Photo by Markus Spiske on Unsplash

In this article I will guide you through my thoughts on how to build a fuzzy search algorithm. A very practical use case of this algorithm is that we can use it to find alternative names for a brand saying ‘Amazon’ and we want it to return strings such as ‘AMZ’, ‘AMZN’ or ‘AMZN MKTP’.

The article follows an outline as the following:

  • Fuzzy search with the FuzzyWuzzy
  • Fuzzy search with the HMNI
  • Fuzzy search with an integrated algorithm
  • Return an alternative names table

Search with the FuzzyWuzzy

FuzzyWuzzy is a great python library can be used to complete a fuzzy search job. Essentially…


Photo by Mykola Makhlai on Unsplash

Problem Statement

Merchant names cleaning can be a quite challenging problem. As different bank provides different quality of transaction data, there is not a very mature way to clean the data. Commonly, merchant names cleaning can be classified as a Named Entity Recognition (NER) task and be solved in a similar way as an entity extraction problem.

For a FinTech company, a merchant names cleaning step is important because developers need to utilize these cleaned merchant names out of originally messy transaction data to generate proper transaction categorization to deliver a better customer experience in terms of managing personal finance.

I found…


Photo by Tina Vanhove on Unsplash

Finally, we are in year 2021 🎉

It's a new chapter of life 🐣

For me, as a data scientist, I wanted to use this opportunity to summarize a list of interesting datasets that I found on Kaggle in 2021. I also hope that this list can be useful to the people who are looking for data science projects to build their own portfolio.

Motivation

After taking many different pathways trying to learn data science, the most effective one I found so far is to work on projects from real datasets. …


In this project, I managed to generate word clouds based on Quantum Physics articles from year 1994 to 2009 on arXiv.org.

Original dataset can be found on Kaggle:

Before introducing my work, I would also like to recommend readings related to my work that I learned a lot from on Medium.

Dataset Overview

Full dataset contains 6 columns includes each article’s

  • Title
  • Abstract
  • Categories
  • Date
  • Id
  • Doi


Photo by Chris Liverani on Unsplash

This project is originally for my Udacity Machine Learning Engineer Nanodegree capstone project.

I found the dataset on Kaggle linked as:

Project Overview

I am very proud to complete this project because it challenged my skills not only in Machine Learning Engineering but also in domains such as Data Engineering and Software Engineering. I managed to learn how to use the Streamlit library in Python to build my whole ML Web app. On the web interface, you can simply start from choosing your ML model type, then adjusting hyperparameters of the model and finally selecting your evaluation metrics.


Photo by Matt Duncan on Unsplash

Sometimes I easily get tired of searching for Python dataset manipulation commands on Stack Overflow.

In fact, there are only several common commands that we often encounter.

Therefore, to save the searching time, I made this summary of some very common and useful dataset manipulation commands in Python. So you don’t have to search for them again and again… ;)

Drop Duplicates

df.drop_duplicates(subset=['A', 'B'], keep='last', inplace=True)

Parameters:

subset: column label or sequence of labels, optional

  • Only consider certain columns for identifying duplicates, by default use all of the columns.

keep: {‘first’, ‘last’, False}, default ‘first’

  • Determines which duplicates (if any) to keep…


Photo by Gema Saputera on Unsplash

This blog post is inspired by the Udacity Data Scientist Nanodegree Capstone Project. One of the project choices was the Starbuck’s Capstone Challenge.

Project Definition

The project is based on the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.

Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.

As a simplification, there are no explicit products to track. Only the amounts of each transaction or offer are recorded.

There are…


Photo by Edwin Andrade on Unsplash

The steps for this project is simply an EDA process for the dataset.

There is definitely a lot more choices to further explore the reviews, such as implementing NLP models.

The link to the dataset:

https://www.kaggle.com/imuhammad/course-reviews-on-coursera

As an active Coursera learner, when I found this dataset on Kaggle, I felt that I was curious to know mainly three questions about Coursera:

1 — What are the top rated courses? / What are the top reviewed courses?

2 — How much does a course rating vary over a specific year?

3 — How much does a course’s number of reviews vary…


Photo by Iva Rajović on Unsplash

Hello!

Today I would like to share some of the insights I found after analyzing the dataset taken from Stack Overflow Annual Developer Survey 2020.

The link to the survey: https://insights.stackoverflow.com/survey

The steps are simple and easy to follow!

Eventually our goal is to build a comparative graph. A graph is always the best way to communicate with the audience who are not very familiar with the original dataset.

Now, let’s begin!

First, let’s see what information is included in the dataset. If you have downloaded the zip file through the link, you will find two CSV files and one…


A little bit introduction about myself and this project: I am a university graduate major in mathematical physics and after graduation I joined a data science diploma program hopefully to start my professional career as a data scientist. This particle identification project was also my capstone project for that diploma program. During the job searching period, now I have more time to learn about data science and keep on refining my skills. …

Gakki Cheng

Data Scientist 👨‍💻 | Article Writer 🍡 | LinkedIn: https://www.linkedin.com/in/cheng-zhang-carson/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store