Photo by Christine Roy on Unsplash

This project includes steps of necessary exploration data analysis and the workflow of building a categorical salary predictor based on real-world data scraped from the Glassdoor.

Essentially, my goal is to have a better understanding about the data science job market so it will be fun ;)

Part I: Scrape Data from Glassdoor with the Selenium

In terms of data scraping tools, I found that Selenium is useful and straightforward to understand. There is actually one article on TDS that specifically explained about how to scrape data on Glassdoor with Selenium. The article can be found as the following:

However, since the original article is written in 2019, I…


A tutorial showing how to deploy SpaCy on a large scale.

Photo by Thomas Evans on Unsplash

Background:

SpaCy is one of my favourite NLP libraries. And I have been using spaCy to perform a lot of Named Entity Recognition (NER) tasks. Generally, we first need to load a spaCy pre-trained model of a specific language and fine-tune the model with our training dataset. The training process can be done offline with a local computer and we can even test the fine-tuned model performance by hosting it locally through Flask / Streamlit.

Although I have found many great tutorials on deploying a spaCy model locally with Flask / Streamlit, there are not many tutorials on how to deploy…


Photo by Markus Spiske on Unsplash

In this article I will guide you through my thoughts on how to build a fuzzy search algorithm. A very practical use case of this algorithm is that we can use it to find alternative names for a brand saying ‘Amazon’ and we want it to return strings such as ‘AMZ’, ‘AMZN’ or ‘AMZN MKTP’.

The article follows an outline as the following:

  • Fuzzy search with the FuzzyWuzzy
  • Fuzzy search with the HMNI
  • Fuzzy search with an integrated algorithm
  • Return an alternative names table

Search with the FuzzyWuzzy

FuzzyWuzzy is a great python library can be used to complete a fuzzy search job. Essentially…


Photo by Mykola Makhlai on Unsplash

Problem Statement

Merchant names cleaning can be a quite challenging problem. As different bank provides different quality of transaction data, there is not a very mature way to clean the data. Commonly, merchant names cleaning can be classified as a Named Entity Recognition (NER) task and be solved in a similar way as an entity extraction problem.

For a FinTech company, a merchant names cleaning step is important because developers need to utilize these cleaned merchant names out of originally messy transaction data to generate proper transaction categorization to deliver a better customer experience in terms of managing personal finance.

I found…


Photo by Tina Vanhove on Unsplash

Finally, we are in year 2021 🎉

It's a new chapter of life 🐣

For me, as a data scientist, I wanted to use this opportunity to summarize a list of interesting datasets that I found on Kaggle in 2021. I also hope that this list can be useful to the people who are looking for data science projects to build their own portfolio.

Motivation

After taking many different pathways trying to learn data science, the most effective one I found so far is to work on projects from real datasets. …


In this project, I managed to generate word clouds based on Quantum Physics articles from year 1994 to 2009 on arXiv.org.

Original dataset can be found on Kaggle:

Before introducing my work, I would also like to recommend readings related to my work that I learned a lot from on Medium.

Dataset Overview

Full dataset contains 6 columns includes each article’s

  • Title
  • Abstract
  • Categories
  • Date
  • Id
  • Doi


Photo by Chris Liverani on Unsplash

This project is originally for my Udacity Machine Learning Engineer Nanodegree capstone project.

I found the dataset on Kaggle linked as:

Project Overview

I am very proud to complete this project because it challenged my skills not only in Machine Learning Engineering but also in domains such as Data Engineering and Software Engineering. I managed to learn how to use the Streamlit library in Python to build my whole ML Web app. On the web interface, you can simply start from choosing your ML model type, then adjusting hyperparameters of the model and finally selecting your evaluation metrics.


Photo by Matt Duncan on Unsplash

Sometimes I easily get tired of searching for Python dataset manipulation commands on Stack Overflow.

In fact, there are only several common commands that we often encounter.

Therefore, to save the searching time, I made this summary of some very common and useful dataset manipulation commands in Python. So you don’t have to search for them again and again… ;)

Drop Duplicates

df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)

Parameters:

subset: column label or sequence of labels, optional

  • Only consider certain columns for identifying duplicates, by default use all of the columns.

keep: {‘first’, ‘last’, False}, default ‘first’

  • Determines which duplicates (if any) to keep…


Photo by Gema Saputera on Unsplash

This blog post is inspired by the Udacity Data Scientist Nanodegree Capstone Project. One of the project choices was the Starbuck’s Capstone Challenge.

Project Definition

The project is based on the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.

Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.

As a simplification, there are no explicit products to track. Only the amounts of each transaction or offer are recorded.

There are…


Photo by Edwin Andrade on Unsplash

The steps for this project is simply an EDA process for the dataset.

There is definitely a lot more choices to further explore the reviews, such as implementing NLP models.

The link to the dataset:

https://www.kaggle.com/imuhammad/course-reviews-on-coursera

As an active Coursera learner, when I found this dataset on Kaggle, I felt that I was curious to know mainly three questions about Coursera:

1 — What are the top rated courses? / What are the top reviewed courses?

2 — How much does a course rating vary over a specific year?

3 — How much does a course’s number of reviews vary…

Gakki Cheng

Data Scientist 👨‍💻 | Article Writer 🍡 | LinkedIn: https://www.linkedin.com/in/cheng-zhang-carson/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store