Discount Offer Strategy Recommendation on a Real World Starbucks Dataset

Photo by Gema Saputera on Unsplash

This blog post is inspired by the Udacity Data Scientist Nanodegree Capstone Project. One of the project choices was the Starbuck’s Capstone Challenge.

Project Definition

The project is based on the data simulates how people make purchasing decisions and how those decisions are influenced by promotional offers.

Each person in the simulation has some hidden traits that influence their purchasing patterns and are associated with their observable traits. People produce various events, including receiving offers, opening offers, and making purchases.

As a simplification, there are no explicit products to track. Only the amounts of each transaction or offer are recorded.

There are three types of offers that can be sent: buy-one-get-one (BOGO), discount, and informational. In a BOGO offer, a user needs to spend a certain amount to get a reward equal to that threshold amount. In a discount, a user gains a reward equal to a fraction of the amount spent. In an informational offer, there is no reward, but neither is there a requisite amount that the user is expected to spend. Offers can be delivered via multiple channels.

The basic task is to use the data to identify which groups of people are most responsive to each type of offer, and how best to present each type of offer.

Problem Statement

We are given three datasets:

  • profile.json (17000 users x 5 fields)
  • portfolio.json — Offers sent during 30-day test period (10 offers x 6 fields)
  • transcript.json — Event log (306648 events x 4 fields)

Through these datasets, we have the opportunity to zoom in and have a detailed view of customer’s purchasing pattern and each unique type of offer’s performance. Eventually, we should be able to give our own suggestion to modify current offer sending strategy in terms of reducing company loss and best locating target customer group.


Since this project doesn’t involve machine learning, we use an alternative way to evaluate our data analysis performance.

  • Loss Analysis to measure company’s total loss after sending each type of offer.
  • Offer Difficulty Analysis to measure each offer’s performance on customers by comparing the ratio of complete_with/without_view on categories of different difficulties.
  • Offer Duration Analysis to measure each offer’s performance on customers by comparing the ratio of complete_with/without_view on categories of different durations.
  • Age Segmentation Analysis to measure customer’s purchase pattern within different age groups.
  • Income Segmentation Analysis to measure customer’s purchase pattern within different income groups.
  • Gender Segmentation Analysis to measure customer’s purchase pattern difference between females and males.

Data Exploration & Data Preprocessing

Since there is no data modelling involved in this project, our metrics will be focused on data cleaning & feature engineering step.

  • portfolio cleaning: extract channel in the list of channels and create a more straightforward dummy variable representation.
  • profile cleaning: fill nans with median & mode values; create a new column calculating the membership length for each user.
  • transcript cleaning: extract event types from value and expand into dummy variables; keep offer_id as a new column feature; fill nans with 0; covert time into days; sorted by person_id & time for next step; drop duplicates.

portfolio cleaning

# id (string) - offer id
# offer_type (string) - type of offer ie BOGO, discount, informational
# difficulty (int) - minimum required spend to complete an offer
# reward (int) - reward given for completing an offer
# duration (int) - time for offer to be open, in days

cleaned portfolio

profile cleaning

# age (int) - age of the customer
# became_member_on (int) - date when customer created an app account
# gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
# id (str) - customer id
# income (float) - customer's income
# difference_days - membership length
cleaned profile

transcript cleaning

# event (str) - record description (ie transaction, offer received, offer viewed, etc.)
# person (str) - customer id
# time (int) - time in hours since start of test. The data begins at time t=0
# value - (dict of strings) - either an offer id or transaction amount depending on the record
# amount - transaction amount
# reward - offer (BOGO/Discount) reward amount

cleaned transcript

By completing above steps, we have had a basic understanding of original datasets. Therefore, in the next step, we will be working on combining these datasets into a fully integrated one. It can be very useful and straightforward for data analysis/visualization.


First, we want to extract pairs of person_id & offer_id for only offer_completed == 1 from the transcript dataset. Doing so will make the algorithm more efficient to run.

# extract data only completed the offer
df = transcript[transcript['offer completed']==1]
df = df.drop_duplicates()
# create pairs of person_id & offer_id for only offer_completed offers
loop = list(zip(df.person, df.offer_id))

Second, we need to make sure the offer_viewed column is representing viewed before completed. Otherwise, we are counting the not viewed offers into viewed offers that would influence customers.

For all viewed events happens after completed, we need to correct viewed into not viewed. It means the customer received the offer, but it’s not viewed hence doesn’t have an influence on customer’s purchasing behaviours.

# below step we modify the viewed column so the viewed == 1 only when it's actually viewed before the offer completes.
# it helps us better measure the offer influence! It takes about 140 mins to process so remember saving it as a local file once finished.

for person_id, offer_id in tqdm(loop):

if transcript[(transcript.person == person_id) & ( == offer_id) & (transcript['offer completed'] == 1)].time.values[0] < transcript[(transcript.person == person_id) & ( == offer_id) & (transcript['offer viewed'] == 1)].time.values[0]:
transcript.loc[(transcript['person'] == person_id) & (transcript['id'] == offer_id), ['offer viewed']]= 0
except IndexError:

Now, we should have a transcript dataset containing events of corrected values.

For the next step, I wanted to have the transcript dataset containing only unique pair of person_id & offer_id. Followed by a sum, it should return a dataset telling us each person’s received offers and how does the person respond to each offer.

# group by pair of ids so we made sure no duplicates
compiled = transcript.groupby(['person','id']).sum().reset_index()
# transactions dropped
compiled = compiled.drop(['time','amount','reward','transaction'],1)
compiled = compiled[ != 0]

Since we performed the sum operation, we want to correct event columns into values either 0 or 1.

# for all values in these columns, make sure either 0 or 1
df = compiled[['offer completed', 'offer received', 'offer viewed']]
df = (df > 0) * 1compiled['offer completed'] = df['offer completed']
compiled['offer received'] = df['offer received']
compiled['offer viewed'] = df['offer viewed']
compiled dataset

Final step in implementation is to build the fully integrated dataset for data visualization. By merging complied dataset with profile & portfolio, we should have the full dataset ‘df’ containing all valuable features.


If you are interested in exploring how difference_days (membership length) affects customer’s purchasing behaviours, you could further categorize the membership length as following

# categorize the membership length 
def label_race (row):
if row.difference_days > 1499 & row.difference_days < 2000:
return 'very long'
if row.difference_days > 999 & row.difference_days < 1500:
return 'long'
if row.difference_days > 499 & row.difference_days < 1000:
return 'medium'
if row.difference_days < 500:
return 'short'
return 'other'
df['membership'] = df.apply(lambda row: label_race(row), axis=1)df = df.drop('difference_days',1)

Congratulations on completing the implementation steps! It’s a pretty time consuming job.

Next we will continue on the most important and insightful part of this project!

Data Visualization

Based on the full dataset df, we can use SQL queries in python to create some new insightful & straightforward datasets.

query = """
ROUND(((1.0*total_completed_without_view) / (1.0*total_completed))*100, 2) as total_completed_without_view_ratio,
100 - ROUND(((1.0*total_completed_without_view) / (1.0*total_completed))*100, 2) as total_completed_with_view_ratio,
`loss ($)`
COUNT(*) AS total_completed
completed = 1
offer_id) complete

COUNT(*) AS total_completed_without_view,
SUM(reward) AS `loss ($)`
viewed = 0
AND completed = 1
offer_id) complete_not_view ON complete.offer_id = complete_not_view.offer_id

total_completed_without_view_ratio DESC

completed_without_view = pdsql.sqldf(query)

Above data frame tells us how each unique offer (discount or BOGO) performs. Since discount and BOGO offers all provide amount of rewards to customer, if the offers are not viewed and completed, it can actually cause a loss for the company. So we sum the total loss for each offer and calculated each ratio of with/without viewed before completed for the company as a future improvement reference.

Zoom into each offer type loss
Loss plot for each offer Id

Generally, we see that BOGO offers create much more loss than the Discount offers. This makes sense since usually the rewards of BOGO is greater than the Discount’s. (For instance, BOGO rewards a cup of coffee and Discount rewards 30% off only).

Below, we will continue on exploring insights about how offer’s difficulty, duration and customer’s membership length affects offer performance.

Offer performance for each offer difficulty

There is no BOGO offer with 20 dollars difficulty. The discount offer (20 dollars) has the largest ratio that complete without views. This means that 20 dollars discounts could cost the company a lot however people actually are not influenced by this certain type of offer. One recommendation would be to reduce the discount offers with 20 dollars difficulty. When we further analyze the customer transaction pattern, we will find that the median of transaction amounts is usually far below 20 dollars.

Further, we can have duration & membership length analysis as following:

Offer performance for each offer duration

10 days duration BOGO offer has a much larger complete_without_view_ratio than 7 days BOGO offer. However, the duration doesn’t cause a big difference for Discount offer. Does it mean that people don’t want to miss BOGO offer chances? Customers are sensitive to the BOGO offer valid duration rather than Discount offer’s. This can be a great strategy to use short-time BOGO offer if we want more customers to involve in some special events.

Offer performance for each membership length

This graph doesn’t tell us too many insights. BOGO offer may have less influence on short membership (less than 500 days) people but discount offer may have a greater influence on this group. If the difference isn’t too obvious, it also tells us that length of membership is not a significant feature when performing customer segmentation.

Above, we complete analysis on features that are easy to categorize. However, for more features like age and income, we can plot them differently.

We will continue on our analysis for features like age, income and gender.

spending_distribution = transcript[transcript['event'] == 'transaction'].merge(profile, left_on='person', right_on='id')

plt.hist(spending_distribution[(15 < spending_distribution.age) & (spending_distribution.age < 25)]['amount'], range=(0, 40), alpha=0.5, bins=40, label='15~25')
plt.hist(spending_distribution[(25 < spending_distribution.age) & (spending_distribution.age < 35)]['amount'], range=(0, 40), alpha=0.5, bins=40, label='25~35')

plt.legend(loc='upper right')

plt.title('Spending per Transaction Distribution')
plt.xlabel('Amount ($)')

Similar transaction patterns for people in age groups 15~25 and 25~35. Medians are between 5 and 10 dollars.

A different pattern for people in age group 35~45 & 45~55, as we can obviously see the medians are shifted to the right. It means that people in these two age groups have more financial ability to purchase.

plt.hist(spending_distribution[(55000 < spending_distribution.income) & (spending_distribution.income < 65000)]['amount'], range=(0, 40), alpha=0.5, bins=40, label='55000~65000')
plt.hist(spending_distribution[(65000 < spending_distribution.income) & (spending_distribution.income < 75000)]['amount'], range=(0, 40), alpha=0.5, bins=40, label='65000~75000')

plt.legend(loc='upper right')

plt.title('Spending per Transaction Distribution')
plt.xlabel('Amount ($)')

People from income group blue and orange have different purchase patterns. Orange group has a larger median than the blue group’s.

Two groups have similar patterns, however, higher income group has more financial ability to purchase.

As the income increases, people don’t stay on the track of purchasing more any more. In fact people with higher income tend to purchase less often with higher spending each time. If we are considering the most active income group, we still should focus on the income group 55000~65000 because they are the most active group.

spending_distribution = transcript[transcript['event'] == 'transaction'].merge(profile, left_on='person', right_on='id')

plt.hist(spending_distribution[spending_distribution['gender'] == 'M']['amount'], range=(0, 40), alpha=0.5, bins=40, label='Male')
plt.hist(spending_distribution[spending_distribution['gender'] == 'F']['amount'], range=(0, 40), alpha=0.5, bins=40, label='Female')
plt.legend(loc='upper right')

plt.title('Spending per Transaction Distribution')
plt.xlabel('Amount ($)')

Great difference between male and female purchase patterns. It seems that females are purchasing more randomly and males are tending to buy one or two items only per purchase. In this way, males tend to have a purchasing habit and are more predictable.


We have analyzed as many as possible features eventually to serve for the goal of locating the target group of customers to improve our current offer segmentation.

This project revealed some hidden insights from the datasets by using skills of data visualization, numerical analysis and data manipulation in SQL. We are able to zoom into the details and find out how much each offer costs the company and which type of offers should be increased/decreased. Besides, we are also able to achieve customer segmentation analysis by evaluating demographics data including customer’s: income, age, gender and membership length.

Combining the results, we grow our confidence in giving several strategy improvements in the following Reflection Section.


From all analysis above, we have an impression that income group 55000~65000 is the most active group. Females purchase more randomly compared to males. Males are tending to purchase one or two items only per purchase. Age group below 35 has less financial ability compared to people above 35. Membership length doesn’t seem to be an important factor evaluating offer influence.

To reduce the loss for sending offers actually not effective, we should reduce offers of 10 days duration & 20 dollars difficulty. Because a large part of customers are not influenced by this type of offers. It means that high duration & difficulty offers tend to create more loss for the company. It makes sense!

Overall, BOGO offers create loss much more than the discount offers do.

Therefore, to achieve maximum benefits of sending offers, we need to categorize our offers based on difficulty levels. For offers of high difficulty, I would recommend sending these to income group above 65000, age above 35, Females. For offers more achievable, I would recommend sending these to income group below 65000, age below 35, Males. The first group doesn’t have a tendency purchasing at a lower price zone but they are less active. The second group has a tendency purchasing at a lower price but are more active!


Further improvements including using unsupervised Machine Learning model directly on the full dataset, in that way, we may have more insightful hidden patterns revealed to us. However, the method of ML model does require more work on the feature engineering steps to reasonably select useful columns. For the implementation step, the method of ML should be faster & easier than manually creating comparative graphs. Also, by using unsupervised learning model, we are giving machine enough freedom to explore the data unlimited by human imagination. Hence I believe it will definitely be valuable to try to use unsupervised learning models on this dataset.

Finally, hope you enjoyed this long reading!


Data Scientist 👨‍💻 | Article Writer 🍡 | LinkedIn:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store