Automated customer segmentation (with Python and UbiOps)

Introduction

Customer segmentation is a great way of improving your understanding of your customers so that you can react better to their needs. To really get the most out of customer segmentation, it is good practice to automate it so you minimize manual work and time spent. In this article, I will show you how you can automate your customer segmentation with the use of Python and UbiOps. I will perform an RFM analysis on transaction data that is stored in a Google Sheet.

You can find the resulting code from this tutorial here.

What is RFM analysis?

RFM analysis is a basic customer segmentation algorithm. It stands for Recency, Frequency, and Monetary analysis. This particular analysis assigns each customer to a group based on three factors:

How recently did the customer buy something?
How often do they buy something?
How much money do they spend, when they buy something?

Using these three factors, the algorithm will assign an RFM score to each customer in the following way:

Divide the customers into quartiles based on recency
Divide the customers into quartiles based on the frequency
Divide the customers into quartiles based on spending
Concatenate the numbers of each quartile a customer is in to achieve a score

If we take example customer Annie for instance, she can be in the 1st quartile for recency, the second for frequency, and the third for spending. Her RFM score would then be 123. A score of 111 would be the best achievable score, and a score of 444 the worst.

Want to dig deeper into RFM? I recommend this article by Leif Arne Bakker.

How to perform RFM analysis on your data

RFM analysis needs transaction data. I will use this Kaggle dataset as an example to show you how to perform an RFM analysis. To start, we will need to import pandas and read in the data.

`import pandas as pd
data_df = pd.read_excel(‘Online_Retail.xlsx’)`

If you want to have a peek at the data you can use `data_df.head()` and you should see something like this:

Now that we have our data we can start our RFM analysis. In this blog post I won’t cover data cleaning to keep it all succinct, but be aware that you will need to make sure your data is clean when you want to apply the following code to your own data set.

The first thing we need to do is calculate the total price of each transaction. To do that we can simply multiply the UnitPrice column with the Quantity column and save the outcome in a new column, which we will call TotalPrice. You can achieve this with the following line of code:

data_df['TotalPrice'] = data_df['Quantity'].astype(int) * data_df['UnitPrice'].astype(float)

Now that we have our necessary columns ready, we can perform the following calculations for our recency, frequency and monetary values:

Recency: Calculate the number of days between the most recent date and the last date of purchase for each customer.
Frequency: Calculate the number of orders for each customer.
Monetary: Calculate the sum of all the money spent for each customer.

We can perform this calculations quickly with a lambda function:

data_df['InvoiceDate'] = pd.to_datetime(data_df['InvoiceDate'])

rfm= data_df.groupby('CustomerID').agg({'InvoiceDate': lambda date: (date.max() - date.min()).days,
'InvoiceNo': lambda num: len(num),
'TotalPrice': lambda price: price.sum()})

Perfect! Now that we have our values, the last thing we need to do to complete this analysis is dividing the customers into different quartiles for recency, frequency and monetary, and concatenating the numbers of each quartile.

# Change the name of columns
rfm.columns=['recency','frequency','monetary']

# Computing Quantile of RFM values
rfm['recency'] = rfm['recency'].astype(int)
rfm['r_quartile'] = pd.qcut(rfm['recency'].rank(method='first'), 4, ['1','2','3','4']).astype(int)
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['4','3','2','1']).astype(int)
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['4','3','2','1']).astype(int)

# Concatenating the quantile numbers to get the RFM score
rfm['RFM_Score'] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) + rfm.m_quartile.astype(str)

After you’ve run this code snippet and you inspect `rfm` with `rfm.head()` the resulting dataframe should look like this:

Every customer now has an RFM score assigned to them, to indicate what type of customer they are. You can use this information to inform your marketing and sales strategies.

Automating your RFM analysis

To get the most out of the RFM analysis, it should be automated and performed periodically on your data. To get started with the automation, you first need to set up a connection between the analysis, and where your data is stored. Right now we use a local excel sheet, but it might be more likely that you have this data stored in some kind of shared place, be it in a database or something more simple like a shared drive. In this article I’ll show you how to automate this analysis if your transaction data is stored in a Google Sheet. Your situation might be very different of course, but the same principles I’ll show you will still apply.

Setting up a service account for your Google Sheet

To set up a connection to your Google Sheet, we will need a service account that has access to the sheet. You can create one as follows:

Head to Google Developers Console and create a new project (or select an existing one)

You will be redirected to the Project Dashboard, there click on “Enable Apis and services”, search for “Sheets API”.

In the API screen click on “ENABLE” to enable this API

Similarly, enable the “Drive API”.
Go to the “Credentials” tab and choose “Create Credentials > Service Account” to create a new service account.
Give the Service account a name and a description
Set the service account permissions as “Compute Engine Service Agent”, skip the third step in the form and click create.
Navigate to the newly created service account and go to the “Keys” tab. Click “Add Key > Create new Key”.
Set the type to JSON and click create. This will prompt a download of a json file which contains the necessary private key for account authorization. Store it in the same folder as where you are running your code.

Okay, we have a service account with the right permissions now. We only need to give the service account access to the sheet with data still. To do this, head over to the Google Sheet with your data and click “share”. Share the google sheet with the email address of the service account you created in the previous steps. The service account will need editor rights, as it will perform both read and write actions.

Retrieving data from the Google Sheet

With the service account you created earlier, we can now establish a connection from your Python code to the Google Sheet. I am using the library pygsheets to help with this.

SCOPES = ('https://www.googleapis.com/auth/spreadsheets', 'https://www.googleapis.com/auth/drive')

# Change the json filename to your json file in the line below
service_account_info = json.loads(“your_json_file.json”)
my_credentials = service_account.Credentials.from_service_account_info(service_account_info, scopes=SCOPES)
gc = pygsheets.authorize(custom_credentials=my_credentials)

A connection has now been established, and we can continue to read in the data from the Google Sheet:
spreadsheet = gc.open(“Online_Retail”)
sheet_data = spreadsheet[0]

# Transforming data into a Pandas DataFrame
data_df = sheet_data.get_as_df()

And now we have a dataframe based on the data in the Google Sheet, which we can use for the analysis of the previous section. Just take out the `data_df = pd.read_excel(“Online_Retail.xlsx”)` line.

Writing the results to the Google Sheet

Now that we can retrieve the data to perform the analysis, we should also make sure we can write the results of our analysis to a place where people can access it. In this case, we will write the results to a separate tab of the same Google Sheet. You can achieve that by adding the following lines of code after you code for the RFM analysis:

sheet_title = “RFM results”
try:
     sh = spreadsheet.worksheet_by_title(sheet_title)
except:

print('Worksheet does not exist, adding new sheet')
spreadsheet.add_worksheet(sheet_title)
sh = spreadsheet.worksheet_by_title(sheet_title)

finally:

sh.set_dataframe(sorted_customers, 'A1', copy_index = True)
sh.update_value('A1', 'CustomerID')
print('Data inserted successfully')

After running the resulting code from top to bottom, so first reading in the data, then performing the analysis, then writing away the results, you should be able to see your results in the Google Sheets!

Moving your analysis to the cloud

So we now have a piece of Python code that retrieves data from a Google Sheet, performs an analysis on it, and writes the results to the same sheet. In its current condition, the code would need to be run locally from a laptop every time you need the newest customer segmentation results. If we move the entire code to the cloud, we can make the analysis run periodically and we can open it up to more people within the company. To achieve this I will be using UbiOps, a platform that allows you to serve Python or R code in a scalable way behind an API. I’ll show you how I use it.

Preparing the code for UbiOps

Before I can put my code in a deployment in UbiOps, I need to put my code in a so-called deployment package. This means that I put my code inside a Deployment Class. It is just to make sure that UbiOps knows what to run when. The deployment.py file for our use case looks like this:

import os
import json
from google.oauth2 import service_account
import pygsheets
import pandas as pd
class Deployment:

def __init__(self, base_directory, context):
print('Initialising the connection to the google drive')
self.gc = None

SCOPES = ('https://www.googleapis.com/auth/spreadsheets', 'https://www.googleapis.com/auth/drive')
service_account_info = json.loads(os.environ['credentials'])

my_credentials = service_account.Credentials.from_service_account_info(service_account_info, scopes=SCOPES)

try:
self.gc = pygsheets.authorize(custom_credentials=my_credentials)
print('Established successful connection')

except Exception as e:
print('Connection failed, ', e.__class__, 'occurred.')
def request(self, data):
print('Getting the requested file')
spreadsheet = self.gc.open(data['filename'])
sheet_data = spreadsheet[0]

# Transforming data into a Pandas DataFrame
data_df = sheet_data.get_as_df()

# RFM analysis
print('Performing RFM analysis')
data_df['TotalPrice'] = data_df['Quantity'].astype(int) * data_df['UnitPrice'].astype(float)
data_df['InvoiceDate'] = pd.to_datetime(data_df['InvoiceDate'])
rfm= data_df.groupby('CustomerID').agg({'InvoiceDate': lambda date: (date.max() - date.min()).days,

'InvoiceNo': lambda num: len(num),
'TotalPrice': lambda price: price.sum()})

# Change the name of columns
rfm.columns=['recency','frequency','monetary']

# Computing Quantile of RFM values
rfm['recency'] = rfm['recency'].astype(int)
rfm['r_quartile'] = pd.qcut(rfm['recency'].rank(method='first'), 4, ['1','2','3','4']).astype(int)
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['4','3','2','1']).astype(int)
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['4','3','2','1']).astype(int)

# Concatenating the quantile numbers to get the RFM score
rfm['RFM_Score'] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) + rfm.m_quartile.astype(str)

# Sort the outcomes
print('Sorting customers')
sorted_customers = rfm.sort_values('RFM_Score', ascending=True)

# Insert data into the google sheet in a separate tab
print('Inserting data into the google sheet')
sheet_title = "RFM_results"

try:
sh = spreadsheet.worksheet_by_title(sheet_title)

except:
print('Worksheet does not exist, adding new sheet')
spreadsheet.add_worksheet(sheet_title)
sh = spreadsheet.worksheet_by_title(sheet_title)

finally:
sh.set_dataframe(sorted_customers, 'A1', copy_index = True)
sh.update_value('A1', 'CustomerID')
print('Data inserted successfully')
return None

Next to the deployment file, we need to make a requirements.txt file for informing UbiOps about the dependencies of our code. The following requirements.txt should suffice:

google-api-core==1.28.0
google-api-python-client==2.5.0
google-auth==1.30.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.4
googleapis-common-protos==1.53.0
numpy==1.20.3
oauthlib==3.1.0
pygsheets==2.0.5
pandas==1.2.4

Once you have these two files, put them in a separate folder and zip the entire folder. That zip is your deployment package.

Creating the deployment in UbiOps

You can create a deployment via Python, via the CLI or via the WebApp, whichever has your preference. I will use the WebApp.

Head over to https://app.ubiops.com and sign in with your account, or create a free one if you don’t have one.
Go to your project and navigate to the “deployments” section, click create.
Give your deployment a name, e.g `automated-rfm`
Under Input define an input field with name “filename” of type string

Leave the other settings as they are and click Next: Create a version

Now that you have a deployment, we can create the version:

At the Deployment package, click Upload zip and upload the deployment package zip you created earlier.
Expand the Optional/Advanced settings and scroll down to Deployment environment variables.
Click Create variable, fill in “credentials” as the name and copy paste your JSON key for your google service account as the value. Mark the environment variable as a secret and hit the checkmark button.
You can again leave all the other settings on their defaults and click Create

Your deployment will now start building. In the background UbiOps will make a Docker image for your code and wrap an API around it. Once it’s done you will see that the status will change to available.

You can test if everything works by clicking the Create request button in the top right once your deployment is available. Use the name of your Google Sheet as input for the request.

Scheduling your analysis

The last step in automating your analysis is to make it run periodically to keep all the RFM scores up to date. You can do that by creating a request schedule in UbiOps that periodically triggers your deployment.

Head over to “Request schedules” in the WebApp and click Create. Give your schedule a name, like “monthly-rfm”, and select your deployment from the dropdown menu. Click Next to continue to the next part of the creation form. Under input data, fill in the name of your Google Sheet and click Next step. Now you will be prompted to define a schedule. Let’s say you need the analysis to be the first day of every month at 8:00, then you can set the following cron job: 0 8 1 * *

After you’ve set the cron job expression, click Create and your schedule will be turned on!

If you’ve followed along you now have an RFM analysis that is performed every first of the month, automatically. This way you have an automatically updated customer segmentation which your team can use to inform their decisions. The example I gave in this article might be specifically for an RFM analysis, but the principles I showed you are applicable to other types of customer segmentations as well.

Do you have a need for automating your analytics? Feel free to contact us!

—————————————————————————————————————————————-

More information

Want to know more about the libraries and tools used in this article? Have a look at these other medium articles:

Pygsheets:

How to automate Google Sheets with Python – by Dayal Chand Aichara

RFM:

Know your customers with RFM – Leif Arne Bakker

UbiOps:

How to build and implement a recommendation system from scratch (in Python)

By industry

By application

On-demand GPU

Featured customers

Latest news

Why is Hybrid Cloud Deployment Useful?

UbiOps Revolutionizes AI Model Inference Using AMD Instinct