Category: General

  • BG/BB CTLV Modeling for Charities

    BG/BB CTLV Modeling for Charities

    Do you run a professional conference that occurs periodically? Are you an owner of a cruise line or a blood drive? Perhaps you own a church. If any of these apply to you then you’re running a business that operates in a noncontractual, discrete-time context. Noncontractual discrete-time contexts are business settings where customer transactions occur at fixed intervals and where a customer can terminate their relationship with the business at anytime. If you wanted to model customer lifetime value for these type of businesses, the Pareto/NBD and BG/NBD would not cut it; both models work only in noncontractual, continuous business settings. What you can use is the Beta Geometric/Beta Bernoulli model, or BG/BB for short. In this post I’ll dive into the nuts and bolts of the model and show you how a charity can use it to predict donations.

    BG/BB Modeling Assumptions

    There are several assumptions that underline the use of the BG/BB model. As done before, I’ll go through each assumption and explain what they mean.

    Assumption #1: A customer can be “alive” for some unobserved period of time, and become permanently inactive, aka “die”.

    “Alive” in this context means that the customer is making transactions with the business. Given that this is a non contractual business setting, we are unable to determine whether the customer has ended their relationship with the business or is taking a long term hiatus. It’s therefore assumed that the customer becomes inactive after some unknown period of time.

    Assumption #2:  The number of transactions that a customer will make while alive follows a binomial distribution.

    Let’s look at this assumption in the context of a charity that requests donations every month from individuals who have made a contribution in the past. Each time the charity reaches out, some individuals may choose to make a donation and some may choose not to.  Given an individual and a month, the we represent the chance that he or she will make a donation using a probability. We’ll call this probability p.

    We can use the binomial distribution to model the probability that an individual will make a number of donations over the course of several months.

    Binomial Distribution

    The number of months and number of donations is represented with n and x respectively. The probability of no donation during a single month is represented by q. This is essentially the inverse of p, the probability that the individual will make a donation.

    Assumption #3: The unobserved lifetime of a customer can be described using a geometric distribution.

    Going back to the charity scenario I created in the previous section, we assume that each month a donor can “drop out” by some probability. We’ll represent the probability using ϴ. If we were curious about the probability a donor will drop out after x number of months we can use the geometric distribution to find out:

    Geometric distribution

    If we were to plot the geometric distribution using some value for ϴ and a range of values for x, we would see that the probability would decrease as x increases, which reflects what we would expect from a customer who has made many donations in the past.

    Assumption #4: Heterogeneity in the transaction probability for each customer follows a beta distribution.

    Each donor will have a different likihood of making a contribution each month. Some donors will contribute at every opportunity while others may choose to donate occasionally. The beta distribution can be used to model the donation probabilities for each donor.

    Assumption #5: Heterogeneity in the dropout probability for each customer follows a beta distribution

    We can also reasonably assume that the dropout probability will vary for each donor. We can also use the beta distribution to model this for each donor.

    Assumption #6: The transaction probability and the dropout probability vary independently across customers

    What this means is that there’s no relationship between the transaction probability and dropout probabilities amongst the customer base. 

    The Model

    The second and fourth assumptions outlined in the previous section result in the Beta Bernoulli model. Similarly, the third and fifth assumptions lead to the beta-geometric model. Together, they form the Beta Geometric/Beta Bernoulli model. For insights on the derivation of the model as well as the functions used to perform various predictions, you can read the paper here.

    BG/BB In Action

    As done with the other models discussed thus far, I will  show you a demonstration of BG/BB. I will be using a dataset included in the lifetimes python model that contains recency and frequency data of individuals making donations to a major nonprofit organization located in the United States.

    We’ll begin by loading the donations dataset.

    import pandas as pd
    import lifetimes
    import lifetimes.datasets as ld
    
    donations_df = ld.load_donations()
    Donations Dataset

    The dataset provided by lifetimes provides all the information we need to utilize the BG/BB model.

    • The frequency column in the dataset records the number of observed periods where a transaction was made.
    • The recency column shows the time period where the most recent transaction was made
    • The periods column indicates the number of transaction opportunities that were provided.
    • The weight column indicates the number of customers with a given frequency, recency, and period combination. 

    We will fit a BG/BB model using the above information.

    bgbb_mdl = lifetimes.BetaGeoBetaBinomFitter()
    bgbb_mdl.fit(frequency=donations_df['frequency'],
                 recency=donations_df['recency'],
                 n_periods=donations_df['periods'],
                 weights=donations_df['weights'])

    Let’s say we were interested in identifying  customers that will still be active 3 months into the future. We can use the conditional_probability_alive method to do this:

    donations_df['is_alive_3_months'] = bgbb_mdl.conditional_probability_alive(m_periods_in_future=3,
                                           frequency=donations_df['frequency'],
                                           recency=donations_df['recency'],
                                           n_periods=donations_df['periods'])
    BG/BB 3 Month Customer Alive Probabilites
    BG/BB 3 Month Customer Alive Probability

    Assuming we were only interested in customers with a alive probability of 0.7 or greater, we can see that roughly a 1/3 of the customer base will still be active after 3 months.

    We can also estimate the number of transactions each customer will make in 3 months time using the conditional_expected_number_of_purchases_up_to_time method:

    donations_df['num_tx_3_months'] = bgbb_mdl.conditional_expected_number_of_purchases_up_to_time(m_periods_in_future=3,
                                                                                                   frequency=donations_df['frequency'],
                                                                                                   recency=donations_df['recency'],
                                                                                                   n_periods=donations_df['periods'])
    BG/BB 3 Month Predicted Transactions
    BG/BB 3 Month Predicted Transactions

    Looking at the same group of active customers, we can expect to receive at least 1 donation from these customers within the next 3 months.

    That’s all folks!

    You can find the complete code discussed in this post here.

  • Gamma Gamma Bills Y’all!

    Gamma Gamma Bills Y’all!

    If you’ve been following my series of posts on customer lifetime value prediction, you should be well aware of Pareto/NBD and BG/NBD. Both models can be used to estimate the lifetime of any given customer and predict the number of transactions the customer will make while alive. It would be nice if we could also get an estimate of the average spend per transaction of each of our customers. The next model I will be going over will do just that. It is called Gamma Gamma.

    Gamma Gamma Modeling Assumptions

    Much like Pareto/NBD and BG/NBD, Gamma Gamma models come with their own set of assumptions.

    • The first assumption is that the monetary value of any given customer transaction varies randomly around the mean transaction value for the customer. We’ll go into more detail about this point later.
    • The second assumption is that the average spend varies across customers, but remains consistent over time for any individual customer.
    • The third assumption is that the distribution of average spend across customers is independent of the transaction process.

    Going back to the first assumption, the value of each transaction is assumed to the distributed according to a gamma distribution. The gamma distribution formula again is:

    Gamma distribution probability density function

    Where α and β describe the shape and rate of the distribution. It just so happens that the rate parameter is also randomly distributed by the gamma distribution as well. The Gamma Gamma model got its name because of its use of two gamma distributions.

    Given the assumptions we can calculate the average monetary value for any customer using the following equation:

    Gamma Gamma expected averge profit calculation

    You can read the original paper on the Gamma Gamma model for details on how the above equation was derived. Now that we’ve discussed what the model is, let’s see it in action.

    Gamma Gamma Model In Action

    To demonstrate the Gamma Gamma model, I will return to the Online Retail dataset used in previous posts.

    I’ll start by once again reading in the transaction dataset:

    transactions_df = pd.read_csv('data.csv',encoding='ISO-8859-1')

    The Gamma Gamma model that I will be using comes from the lifetimes package. The model requires two pieces of information in order to provide predictions:

    • The frequency of the customer transactions, and
    • The total amount spend for each transaction

    The required information will need to be calculated from the dataset. For the sake of brevity I will not be showing the calculation of the frequency and monetary values. You can refer to my RFM blog post for the details.

    At this point, if you’ve read my previous post, we will have two pandas dataframes containing the frequency and monetary values. We will now combine them into one dataframe to make management of the data a little bit easier. We also want want to make sure that our dataframe does not contain any monetary values less than or equal to 0; the method that we will be using to calculate expected average profit will not work if such data is present.

    combined_df = pd.merge(frequency_df, monetary_df, on='CustomerID')
    combined_df = combined_df[combined_df['monetary'] > 0]
    

    Now before proceeding with the model fitting we want to check for a correlation between the frequency and monetary values. This is to ensure that the third assumption of the Gamma Gamma model is valid. We won’t be able to use the model if there’s a strong correlation between the two fields.

    combined_df[['frequency','monetary']].corr()
    frequency monetary correlation

    There appears to be a moderate correlation between frequency and monetary, but definitely not strong enough to stop us from using the model. We will go ahead and proceed with the fitting.

    gg_mdl = lifetimes.GammaGammaFitter()
    gg_mdl.fit(combined_df['frequency'], combined_df['monetary'])

    The expected average spend per transaction for each customer can be computed using the conditional_expected_average_profit method.

    combined_df['average_profit'] = gg_mdl.conditional_expected_average_profit(combined_df['frequency'],combined_df['monetary'])
    expected average profit

    Estimating Customer Lifetime Value

    Gamma Gamma can be combined with either the Pareto/NBD or BG/NBD to calculate a lifetime value figure for each customer. This is all possible with the customer_lifetime_value method. To use it, we’ll first need to build a transaction model. I will be demonstrating the use of the method using a Pareto/NBD model. Read my Pareto/NBD post for details on the construction of the model.

    Some of the other parameters used by the method that you’d want to know about are:

    • Frequency: The frequency of the customer transactions
    • Recency: The amount of time passed since the customer’s most recent transaction
    • Age: The amount of time since the customer’s initial purchase
    • Monetary: The monetary value of each transaction
    • Time: The amount of months into the future to estimate the lifetime value
    • Discount Rate: The monthly adjusted discount rate
    • Freq: The time unit of the age is measured in

    Here, we will be estimating the 3 month customer lifetime value for each customer.

    combined_df['cltv'] = gg_mdl.customer_lifetime_value(pareto_mbd_mdl,
                                                        combined_df['frequency'],
                                                        combined_df['recency'],
                                                        combined_df['age'],
                                                        combined_df['monetary'],
                                                        time=3,
                                                        freq='D')
    3 Month CLTV Estimation

    That’s all Folks!

    You can find the code for this post here.

  • Introducing BG/NBD Models

    Introducing BG/NBD Models

    In my last post we examined the nuts and bolts of the Pareto/NBD model. If you haven’t had a chance to read it, you can find a link to the article here. The Beta Geometric Negative Binomial Distribution, or BG/NBD for short, is another probability model that is frequently used by practitioners predict customer lifetime value in continuous, non-contractual settings. The underpinnings of the model is very similar to that of the Pareto/NBD. This post will cover the differences between the two models. Afterwards a quick demonstration of the model in action will be shown.

    BG/NBD Modeling Assumptions

    The modeling assumptions of BG/NBD is similar to that of the Pareto/NBD. The table below provides a side-by-side comparison of the assumptions behind both models:

     Pareto/NBDBG/NBD
    1A customer can be “alive” for an unobserved period of time, and then die.A customer can be “alive” for an unobserved period of time, and then die.
    2While “alive”, the number of transactions made by a customer can be described using a Poisson distribution.While “alive”, the number of transactions made by a customer can be described using a Poisson distribution.
    3Heterogeneity in the transaction rate across customers follows a gamma distribution.After each transaction, a customer can drop out with probability p.
    4Each customer’s unobserved lifetime is distributed exponentially.Each customer’s dropout probability is distributed across transactions according to a geometric distribution.
    5Heterogeneity in dropout rates across customers follows a gamma distribution.Heterogeneity in dropout probabilities follow a beta distribution.
    6Both the transaction rates and the dropout rates vary independently across customers.Both the transaction rates and the dropout probabilities vary independently across customers.
    Modeling assumptions

    As you can see the first two assumptions of BG/NBD is the same as that of the Pareto/NBD. The third assumption and onwards are where the key differences lie. Let’s take a closer look at each assumption and break down what they mean.

    Assumption #3: After each transaction, a customer can drop out with probability p.

    BG/NBD takes a different approach to modeling the lifetime of each customer than Pareto/NBD. After each transaction, the model assumes that a customer may choose to discontinue their patronage afterwards. If we were looking at a grocery store for example, this would be akin to a customer shopping at the grocery store one week, and then choosing not to return the following week; perhaps because the customer found another store that sells groceries at a cheaper price.  The likihood that a customer will dropout is represented with a probability p.

    Assumption #4: Each customer’s dropout probability is distributed across transactions according to a geometric distribution.

    Geometric distributions are used to model the number of independent trials until the first success. In our case, we are interested in modeling the number of transactions a customer will make before dropping out. We can figure out the likihood a customer will drop out after making some number of transactions using this formula:

    Geometric distribution

    The x in the equation is the number of transactions; p is the probability of dropping out after each transaction, as discussed in the previous assumption.

    If we were to plot the geometric distribution using some value for p and a range of values for x, we would see that the probability would decrease as x increases, which reflects what we would expect from a customer who makes many repeat purchases. These customers are likely to stick around much longer, than those who make only a few purchases.

    Assumption #5:  Heterogeneity in dropout probabilities follow a beta distribution.

    In other words, not every customer has an equal likihood of dropping out after each transaction. Some customers may stick around much longer than others. Therefore we need a way to represent the variation in dropout probabilities for each customer. That way is the beta distribution.

    Beta distribution

    Beta distributions are distributions on probabilities. It can be used to model conversion rates, how likely someone will click on a Facebook ad,  the survival rate of patients with a disease, and other similar probabilities. The parameters a and b represent the number of successes and failures we expect.

    The Model

    The assumptions described above allow us to predict the probability that a customer will be alive after a certain time period and the number of purchases we can expect a customer to make in the future. This resource provides an excellent derivation of the equations used to calculate both.

    BG/NBD In Action

    An implementation of BG/NBD is provided by the lifetimes python package. The use and setup of the model is the same as that of the Pareto/NBD. The only difference of course is the instantiation of the model:

    mdl = lifetimes.BetaGeoFitter()

    My previous post provides a run through of the Pareto/NBD model on a sample dataset. Just replace the instatiation of the model with the code snippet above and you’re all set!

    That’s all folks!

    Next time I will show you another model that can be used in to predict monetary value of the customer transactions, giving us a complete picture of the customer lifetime value. Talk to you then!

  • Buy Till You Die Modeling with Pareto/NBD

    Buy Till You Die Modeling with Pareto/NBD

    This post is all about Pareto/Negative Binomial Distribution models or commonly referred to as Pareto/NBD. In my previous post I described four business cases where customer lifetime value can be calculated using probability models. Pareto/NBD can be used to model transactions in a non-contractual, continuous business settings. These are cases where customers can not only make transactions at any time but also terminate their relationship with a business at any time but also terminate their relationship with a business at any time as well.

    First, we’ll take a look at the nuts and bolts of the Pareto/NDB and the assumptions that drive the use of the model. Afterwards we’ll look at some code that shows the model in action.

    Pareto/NBD Modeling Assumptions

    There are six key assumptions underlying the use of the Pareto/NBD model. Let’s go though each and cover what they mean.

    Assumption #1: A customer can be “alive” for an unobserved period of time, and then die.

    By “alive” we mean that the customer is actively purchasing products and services from the business. We consider the customer as “dead” when there is no purchasing activity from the customer for some period of time.

    Assumption #2: While “alive”, the number of transactions made by a customer can be described using a Poisson distribution.

    A Poisson distribution is a probability distribution that is used to describe the chance of one or more events occurring during a time period. Let’s suppose for example we have 18 months of transaction data for a grocery store customer. We know that on average that the customer makes 2.5 transactions per week and we’d like to find the probability that the customer would make 5 transactions next week. We can find this out using below formula:

    The µ symbol represents the event rate per time period, which in our case is 2.5. The variable k represents the number of events. For our pet problem this will be 5. Punching in the numbers we’ll get an approximate number 0.06680.

    If we create a plot of the Poisson distribution for little scenario we’ll get this:

    Poisson distribution plot

    From the plot, we can see the distribution peaks at 2.5. We can also see that the probability becomes smaller the further it is away from 2.5, which is consistent with the calculation we made earlier.

    As an aside before moving on if we wanted to calculate the probability of events occurring over multiple time periods (as in 12 transactions in 3 weeks), we can use this modified version of the Poisson formula:

    Where t represents the number of periods we’re interested in.

    Assumption 3: Heterogeneity in the transaction rate across customers follows a gamma distribution.

    Customers have different shopping habits. If we’re looking at a massage business for instance some customers may schedule massage sessions every week, while other customers request massages once in a blue moon. With this in mind, we would need a way to account for the variation in transaction rates for each customer.  That way is the gamma distribution.

    Whereas a Poisson distribution can be used to determine the chance of a transaction happening at a time period, a gamma distribution is used to predict the amount of time before a desired number of transactions occurring.

    Using the grocery store example from the previous section, if we wanted to determine the probability that it take 3 weeks for the customer to make 5 transactions, this is the formula we would use:

    The variables α and β represent the number of events and the event rate. The variable t represents the wait time. The expression in the denominator of the formula is known as the gamma function. You can find our more information about it here.

    Assumption 4: Each customer’s unobserved lifetime is distributed exponentially.

    There’s always a degree of uncertainty as to whether or not a customer has taken their business elsewhere.

    Let’s say that you have two customers. Customer A gets massages consistently roughly one a month over the past 15 months. We could safely assume that this customer will be returning in the next month. However customer B used to get massages once a week but hasn’t been present for a few months. In that case we may assume that the customer won’t be around in the next month.  We can use an exponential distribution to model this behavior.

    The exponential distribution represents the time between events in a Poisson distribution. In our case we can use it to predict when a customer will terminate their relationship with our business.

    Assumption 5: Heterogeneity in dropout rates across customers follows a gamma distribution.

    Not every customer has an equal likihood of leaving your business. Much like the transaction rates discussed earlier we need to model the variation in dropout rates for customers using a probability distribution. Once again, the gamma distribution is what is used for this.

    Assumption 6: Both the transaction rates and the dropout rates vary independently across customers.

    This means that the dropout and transaction rate of one customer tells us nothing about the dropout rate and transaction rate about another customer. They don’t impact each other. The transaction of one customer does not influence the transaction of another customer.

    The Model

    So… lets now talk about how these assumptions create the Pareto/NDB model. If we combine the second and fourth assumption from above, we’ll get the negative binomial distribution (NBD) model. This model is used to predict the number of transactions k a customer will make while alive, given at a time period t.

    When we combine assumptions three and five, we get the pareto distribution, which is used to model the lifetime of the customer.

    If you’d like to see the derivation of the entire model you can an excellent resource that breaks down the process here.

    The parameters of the Pareto/NBD model, α and β, are estimated using the recency and frequency of the customer transactions.

    Now that we’ve discussed the components of the Pareto/NDB model, let’s see it in action.

    Pareto/NBD in action

    To demonstrate Pareto/NBD we will use the online retail dataset.

    We’ll begin by loading the dataset:

    transactions_df = pd.read_csv('data.csv',encoding='ISO-8859-1')

    The Pareto/NDB model that we will be using comes from the lifetimes Python module. In order to fit the model to our dataset, three parameters are required:

    • The frequency of the customers information
    • The amount of time passed since the customer’s last purchase, a.k.a recency.
    • The amount of time since the customer’s first purchase, a.k.a age

    Lifetimes also provides a utility function that can be used to calculate the above information for most datasets. Our dataset, unfortunately, is not formatted in a way that allows the function to properly calculate the frequency. We will therefore need to calculate the information manually.

    For the sake of brevity I will only be showing the calculation of the customer age. Read my RFM blog post to find the calculation for recency and frequency.

    first_transactions = transactions_df.groupby('CustomerID')['InvoiceDate'].min().reset_index()
    first_transactions['age'] = first_transactions['InvoiceDate'].apply(lambda date: (most_recent_transaction - date).days)

    We’ll merge the recency, frequency, and age into one pandas dataframe before fitting the Pareto/NDB model

    recency_frequency_df = pd.merge(pd.merge(recency_df, frequency_df, on='CustomerID').drop('InvoiceDate',axis=1), 
                                    first_transactions, on='CustomerID').drop('InvoiceDate', axis=1)

    Now for the model fitting

    mdl = lifetimes.ParetoNBDFitter()
    mdl.fit(recency_frequency_df['frequency'], recency_frequency_df['recency'], recency_frequency_df['age'])

    The Pareto/NBD model can be used to generate probabilities that a customer is still alive. Let’s generate probabilities for each customer in the dataset using conditional_probability_alive method:

    recency_frequency_df['probability_alive'] = mdl.conditional_probability_alive(recency_frequency_df['frequency'],
                                                                                  recency_frequency_df['recency'],
                                                                                  recency_frequency_df['age'])
    Conditional Probabilities of Being Alive

    We can also visually inspect a heatmap of probabilities that our customers are alive using the plot_conditional_probability_alive function.

    from lifetimes.plotting import plot_probability_alive_matrix
    plot_probability_alive_matrix(mdl)
    Probability Alive Plot

    The heatmap shows the probability that our customers are still alive based on historical frequency and recency values. Recency in the plot is defined as the time between the first and last transaction. From the plot we can see that

    • Customers who have made multiple purchases over a large period of time have a high chance of being alive.
    • New customers with a few transactions are also likely to be still be alive.

    The model can also estimate the number of transactions customers will make in the future. We will use the conditional_expected_number_of_purchases_up_to_time method to predict purchases for each customer 20 days into the future:

    recency_frequency_df['predicted_transactions'] = mdl.conditional_expected_number_of_purchases_up_to_time(20,recency_frequency_df['frequency'],
                                                                                                             recency_frequency_df['recency'],
                                                                                                             recency_frequency_df['age'])
    Expected transactions

    Both the predicted probabilities and expected number of transactions give us more insight on the high value customers in the dataset.

    That’s all folks!

    In my next post I will cover BD/NBD model, a probability model that is very similar to the Pareto/NBD model but makes slightly different assumptions about the customer behavior. You can find the entire code discussed in the post here.

  • Probability Models for LTV Calculation

    Probability Models for LTV Calculation

    If you’re looking for quick and dirty methods for calculating the lifetime value of your customer base, the first and second post in my series are just what you’re looking for. In this post we will dive into probability models and how they can provide a more sophisticated method for estimating LTV than the simple averaging methods discussed thus far.

    Enter Probability Models

    If we examine the purchasing activity of any group of consumers, we’ll see that there’s a natural variation in behavior between individuals. We can use probability models to take this variation into account when calculating lifetime value for any customer.

    The probability model you’d want to use will depend on the business context it will be applied in. Ask yourself two questions:

    1. Do my customers have a contractual relationship with my business?
    2. Do customer transactions happen at specific times  or can they happen at any time?

    Let’s go over these two questions real quick and explain what they mean.

    Contractual vs. Non Contractual

    The first question you want to ask yourself when determining the optimal model to use for your problem if customers interact with your business on a contractual basis. Companies that operate on a subscription business model fall into this category. In contractual settings, the moment in which a customer ends their patronage can be observed. This is typically indicated by a customer deciding to cancel their subscription.

    Customer lifetime value for contractual engagements can be estimated using survivor based models.

    As for non contractual businesses, where the end of a customer’s patronage is not known, the customer lifetime value is often modeled using exponential models.

    Discrete vs. Continuous

    After determining if you’re dealing with a contractual or non-contractual problem, you next want to think about how customers will be making transactions with your business.

    If your customers make transactions at fixed periods of time (i.e. weekly or monthly) than we call these transactions discrete. SaaS companies that bill their customers for services every month would fall under this category.

    If on the other hand a customer can make a transaction at your business at any time the transactions are considered to be continuous. Purchases made at an electronic store would fall under this category.

    The Business Contexts At a Glance

    Combining the answers to the two questions previously covered you’ll notice that there’s four types of business contexts that can come up. Below are the four contexts and examples of businesses that fall under each.

    Non-contractual settings with continuous purchases

    • Movie rentals
    • Medical appointments
    • Hotel stays
    • Grocery purchases

    Contractual settings with continuous purchases

    • Costco membership
    • Credit cards

    Non-contractual settings with discrete purchases

    • Prescription refills
    • Event attendance

    Contractual settings with discrete purchases

    • Magazine subscriptions
    • Gym membership
    • Netflix, Hulu, and other streaming websites

    That’s all folks!

    In my next post I will discuss the Pareto/NBD model, a model that can be used to describe continuous transactions in non contractual business settings. Until next time.

  • Calculating Lifetime Value with Cohorts

    Calculating Lifetime Value with Cohorts

    Customer lifetime value is a metric that can be used to gauge the health of a business. In my last blog post on the topic, I showed python code that estimates customer lifetime value by computing a simple average. Today I will show you another approach that is an improvement over the one described previously.

    Introducing Cohort Aggregates

    This method for estimating customer lifetime value computes multiple metrics, one for each cohort of customers in the dataset. Not only does this provides a much better estimate over a simple average, but it also allows us to study the behavior of customers in response to different marketing campaigns. We will carry out the calculation using the same <link to dataset> dataset used in the last post.

    Let’s begin the same way as before by loading the dataset

    with open('CDNOW_master.txt') as f:
        dataset = f.read().split("\n")
    
    records = []
    for line in dataset:
        if line == '':
            continue
        row = list(filter(lambda token: token != '', line.split(' ')))
        rec = {}
        rec['customerID'] = row[0]
        rec['purchaseDate'] = datetime.strptime(row[1], '%Y%m%d')
        rec['quantity'] = int(row[2])
        rec['price'] = float(row[3])
        records.append(rec)
    transactions_df = pd.DataFrame(records)

    Next we’ll create a new column in our dataframe that contains total dollar value for each transaction. This column will be used later when we calculate customer lifetime value for each cohort.

    transactions_df['total'] = transactions_df['price'] * transactions_df['quantity']

    Speaking of cohorts, allow me to explain what I mean by this. For the same of this analysis I’ll define a cohort as a group of customers who made their first transaction in the same month. In order to perform this grouping, we first need to figure out what month each person became a customer. The following code snippet will create a dataframe containing the information we need.

    first_transactions_df = transactions_df.groupby('customerID')['purchaseDate'].min().reset_index()
    
    first_transactions_df['firstTransactionMonth'] = first_transactions_df['purchaseDate'].dt.month
    
    first_transactions_df['firstTransactionMonth'].value_counts()
    Number of newly acquired customers by month

    Amazing! Now we can merge this dataframe into our transactions dataset so everything is in one place.

    
    transactions_df = pd.merge(transactions_df, first_transactions_df[['customerID','firstTransactionMonth']], on='customerID')
    Customer Transactions Dataframe

    We’re now ready to calculate customer lifetime value for each cohort. To do this we will define a function that computes the customer lifetime value for a subset of customers.

    def calculate_cltv(df):
        transactions_per_customer = df.groupby('customerID')['purchaseDate'].count()
        avg_frequency = transactions_per_customer.mean()
        
        minmax_purchase_dates_by_customer = df.groupby('customerID')['purchaseDate'].agg(['min','max'])
        
        customer_lifetimes = minmax_purchase_dates_by_customer.apply(lambda row: (row['max'] - row['min']).days, axis=1)
        avg_lifetime = customer_lifetimes.mean()
        
        avg_order_value = transactions_df['total'].mean()
        
        return avg_frequency * avg_lifetime * avg_order_value

    All that’s left to do now is to aggregate our transactions by the firstTransactionMonth and apply the function we defined earlier to each group.

    transactions_df.groupby('firstTransactionMonth').apply(calculate_cltv)
    Customer Lifetime Value by Cohort

    Lo and behold we are done!

    That’s all folks!

    You can find the complete code for this post here. Until next time!

  • A Primer on Customer Lifetime Value

    A Primer on Customer Lifetime Value

    A challenge faced by many companies is properly budgeting for marketing campaigns. Recently I’ve been digging into quite a bit of literature on customer lifetime value in preparation for a project I will be taking on with a client. In this post I will provide an quick overview of methods for modeling customer lifetime value for business. I will also show a simple calculation of customer lifetime value in python. Many of the methods highlighted in this post will be explored more in detail in future posts.

    What is customer lifetime value?

    Customer lifetime value, or CLTV, is a metric that measures the total value a customer will bring to a company over his/her lifetime. Customer lifetime value is an important metric that helps companies gauge the health of their business.

    For starters, the cost of acquiring new customers is typically high for many organizations. Retaining these customers is therefore a huge interest to a company. CLTV allows companies to effectively allocate resources in order to prevent churning.

    Much like the RFM technique discussed in my last post, customer lifetime value helps a company identify customers identify customers that are more valuable to the business and optimize their marketing spend. By analyzing historical purchases made by customers, CLTV allows companies to predict what kind of purchases they can expect their customers to make in the future.

    How customer lifetime value is calculated?

    As I mentioned at the beginning of this post, there are several methods for calculating and predicting customer lifetime value. I’ll now go through each method in order of simplest to most complex.

    Method 1: Simple Average

    This method involves calculating a single value that represents the average customer lifetime value for the entire customer base. Calculating customer lifetime value is pretty straight forward. You’ll need to know three things:

    • The average monetary value for each transaction (AOV)
    • The average customer lifespan (ACL)
    • The average purchase frequency rate (APF)

    The formula for customer lifetime value is as follows:

    Calculating CLTV in Python

    As a demonstration I will walk you through the computation of customer lifetime value using the CDNow dataset.

    The CDNow dataset contains the purchase history up until the end of June 1998 of a cohort of individuals who made their first purchase at CDNow in the first quarter of 1997. The text file containing the data provides the unique identifier for the customer, the date of the transaction, the number of CDs purchased, and the dollar value of the transaction.

    We’ll start by reading the text file and creating a dataframe:

    with open('CDNOW_master.txt') as f:
        dataset = f.read().split("\n")
    
    records = []
    for line in dataset:
        if line == '':
            continue
        row = list(filter(lambda token: token != '', line.split(' ')))
        rec = {}
        rec['customerID'] = row[0]
        rec['purchaseDate'] = datetime.strptime(row[1], '%Y%m%d')
        rec['quantity'] = int(row[2])
        rec['price'] = float(row[3])
        records.append(rec)
    transactions_df = pd.DataFrame(records)
    CDNow dataframe

    Next we will calculate the average purchase frequency:

    transactions_per_customer = transactions_df.groupby('customerID')['purchaseDate'].count()
    avg_frequency = transactions_per_customer.mean()

    Now for the average customer lifespan:

    minmax_purchase_dates_by_customer = transactions_df.groupby('customerID')['purchaseDate'].agg(['min','max'])
    customer_lifetimes = minmax_purchase_dates_by_customer.apply(lambda row: (row['max'] - row['min']).days, axis=1)
    avg_lifetime = customer_lifetimes.mean()

    Last but not least the average dollar value per transaction:

    transactions_df['total'] = transactions_df['price'] * transactions_df['quantity']
    avg_order_value = transactions_df['total'].mean()

    Finally we will calculate the customer lifetime value using all the pieces:

    customer lifetime value calculation

    As you can see the final number amounts to $64,906. This method, while simple can severely overestimate CLTV for many customers. Using pandas we can examine the distribution of the purchases for each customer.

    avg_order_value_per_customer = transactions_df.groupby('customerID')['total'].mean()
    avg_order_value_per_customer.describe()
    Average order value summary statistics

    As you can see we have some big spending customers who are making the CLTV much higher than it should be. This is even more apparent when we look at a boxplot of the same data.

    Boxplot

    Method 2: Cohort Average

    This method involves dividing customers into cohorts and computing customer lifetime value for each. A cohort defined as customers who made the purchase during the same time period. Not only does this method provides a better estimation of CLTV just using a simple average, but it also allows us to study the behavior patterns of customers as a result of different marketing campaigns. I’ll cover this method more in detail in a future post.

    Method 3: Predictive Probabilistic Models

    The more sophisticated modeling techniques cited in literature use probability distributions to estimate the frequency, lifetime, and monetary components of the customer lifetime value equation. There are a number of methods that can be used, each ideal for specific business contexts. The Beta Geometric/Negative Binomial Distribution, Pareto/Negative Binomial Distribution, and Gamma-Gamma models are a few of the most popular probabilistic models used by companies today. Look out for future posts on this blog about all of these models.

    Method 4: Machine Learning

    All the approaches I highlighted utilize recency, purchase frequency, and order values for each customer to predict customer lifetime value. What if you’d like to incorporate other variables beyond these in your prediction? This is where machine learning comes in handy. Machine learning algorithms can find patterns in the data to accurately predict future customer behaviors.

    That’s all folks!

    The code shown in this post can be found here. Until next time!

  • Simple yet effective RFM

    Simple yet effective RFM

    Looking for a quick and dirty way to segment your customer data that does not require machine learning? In this post we will take a look at a python implementation of a popular technique used by marketers to divide customers into buckets of high, medium, and low value customers. The technique is called RFM.

    What is RFM?

    RFM quantitively ranks and groups customers based on three factors:

    How recent was the customer’s last purchase. The more recently a customer made a purchase, the more likely that customer will make a purchase again as he/she would likely still have the product on their mind.

    How frequently did the customer purchase products. Frequent patrons are likely to purchase again in the future and are therefore valuable to the business.

    The amount of money the customer spent. Customers who spent a lot of money during a particular period of time are very valuable to the business, as they may make another purchase in the future.

    As you may have guessed RFM is short for recency, frequency, and monetary. Here’s how it works:

    1. First, rate each customer on each of the three factors. Generally a scale of 1 to 5 is used, with 5 being the highest possible score. However, you are free to choose a different scaling that makes better sense for your problem.
    2. Next step, average each score together to compute an RFM score. The higher the score is for a customer the more valuable the customer is. Alternatively, you can compute a weighted average of the three scores if certain factors in your problem are more important than others.
    3. Lastly, group each customer into high, medium, and low value buckets based on the RFM score.

    RFM in action

    Now that you’re acquainted with RFM, we’ll apply it to some customer data so you can see it in action. The dataset we’ll use is from an UK based online retail store.

    We’ll start by reading the provided CSV file and inspecting the data attributes.

    transactions_df = pd.read_csv('data.csv',encoding='ISO-8859-1')
    transactions_df.info()

    Here’s what you’ll see as output:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 541909 entries, 0 to 541908
    Data columns (total 8 columns):
     #   Column       Non-Null Count   Dtype  
    ---  ------       --------------   -----  
     0   InvoiceNo    541909 non-null  object 
     1   StockCode    541909 non-null  object 
     2   Description  540455 non-null  object 
     3   Quantity     541909 non-null  int64  
     4   InvoiceDate  541909 non-null  object 
     5   UnitPrice    541909 non-null  float64
     6   CustomerID   406829 non-null  float64
     7   Country      541909 non-null  object 
    dtypes: float64(2), int64(1), object(5)
    memory usage: 33.1+ MB

    For this exercise, we will be focusing on the InvoiceNo, Quantity, InvoiceDate, UnitPrice, and CustomerID attributes.

    The CustomerID attribute uniquely identifies an online store customer.

    The InvoiceNo, and InvoiceDate attributes uniquely identifies a transaction performed by a customer and the date it was performed.

    A transaction in this dataset is composed of one or more items purchased at various quantities. The Quantity and UnitPrice attributes contains the amount of an item purchased and the cost per item.

    Looking at the output above, it appears that the CustomerID attribute contains missing values. We will proceed by discarding the rows with the missing values from the dataset.

    transactions_df = transactions_df[transactions_df['CustomerID'].notnull()].copy()

    Next, we will convert the InvoiceDate attribute into a datetime attribute to make it easier to work with.

    transactions_df['InvoiceDate'] = pd.to_datetime(transactions_df['InvoiceDate'])

    Looking at the transaction data, it looks like there’s some transactions with negative quantities. These are likely refunds made to the customer. We will discard these transactions from our analysis as well.

    transactions_df = transactions_df[transactions_df['Quantity'] > 0]

    Next up, we will create three new pandas dataframes containing computed recency, frequency, and monetary values for each customer.

    We will make use of the newly converted InvoiceDate attribute to obtain the most recent transaction for each customer and compute the number of days since the latest transaction in the entire dataset.

    # Get the most recent transaction in the entire dataset
    most_recent_transaction = transactions_df['InvoiceDate'].max()
    
    # Compute the lastest transaction for each user
    latest_transactions_per_user = transactions_df.groupby('CustomerID')['InvoiceDate'].max()
    
    recency_df = latest_transactions_per_user.reset_index()
    recency_df['recency'] = recency_df['InvoiceDate'].apply(lambda date: (most_recent_transaction - date).days)
    
    # Inspect the results
    recency_df.head()
    recency values per customer

    Next we will create a dataframe containing the number of reqeat transactions made by each customer. The code snippet below accomplishes this by computing the number of unique invoices produced for each customer.

    # Calculate the number of invididual invoices generated for each customer
    num_transactions_per_user = transactions_df.groupby('CustomerID')['InvoiceNo'].unique().apply(lambda lst: len(lst))
    
    frequency_df = num_transactions_per_user.reset_index().rename(columns={'InvoiceNo':'frequency'})
    
    # Inspect the results
    frequency_df
    frequency values per customer

    Finally we will calculate the total spend for each customer.

    # Create a new column containing the total amount spent for each item
    transactions_df['Total'] = transactions_df['Quantity'] * transactions_df['UnitPrice']
    
    # Now, aggregate total spend by customer
    monetary_df = transactions_df.groupby('CustomerID')['Total'].sum().reset_index().rename(columns={'Total':'monetary'})
    
    monetary_df
    monetary value per customer

    Let’s merge the recency, frequency, and monetary dataframes into one dataframe.

    rfm_df = pd.merge(pd.merge(recency_df, frequency_df, on='CustomerID'), monetary_df, on='CustomerID').drop('InvoiceDate',axis=1)
    combined recency, frequency, and monetary values

    We’ll now rank the recency, frequency, and monetary values on a scale of 1 to 5. We will use the pandas cut function can be used to bin each value into 5 evenly spaced intervals. You can also specify your own binning intervals you’d like cut to use as well.

    rfm_df['recency_score'] = pd.cut(rfm_df['recency'], bins=5, labels=[5,4,3,2,1]).astype(int)
    rfm_df['frequency_score'] = pd.cut(rfm_df['frequency'], bins=5, labels=[1,2,3,4,5]).astype(int)
    rfm_df['monetary_score'] = pd.cut(rfm_df['monetary'], bins=5, labels=[1,2,3,4,5]).astype(int)
    rankings per customer

    Using the recency, frequency, and monetary ratings we will create an RFM score. We will calculate scores for each user by computing a weighed average of the three rankings.

    rfm_df['rfm_score'] = (0.35*rfm_df['recency_score'] + 0.35*rfm_df['frequency_score'] + 0.40*rfm_df['monetary_score'])
    rfm_df['rfm_score'] = rfm_df['rfm_score'].round(2)
    RFM scores per user

    Now that the RFM scores are computed for each customer, let’s form three segments of high, medium, and low value customers.

    • We’ll define high tier customers as customers with RFM scores greater than 3
    • Medium tier customers will be defined as customers with RFM scores between 2 and 3
    • Customers with RFM scores less than 2 will be placed in the low tier bucket.
    def segment_customer(score):
        if score >= 3:
            return 'High'
        elif score >= 2 and score < 3:
            return 'Medium'
        else:
            return 'Low'
    
    rfm_df['segment'] = rfm_df['rfm_score'].apply(segment_customer)
    Customer segments

    We got our segments! We can aggregate our customers by the segments by see what what it looks like.

    rfm_df.groupby('segment')['CustomerID'].count()

    The output…

    segment
    High        16
    Low       1041
    Medium    3282
    Name: CustomerID, dtype: int64

    That’s all folks!

    RFM is a simple and quick way to segment your customers into buckets that you can quickly take action on. If you’re looking for segmentation techniques that utilize additional customer attributes, check out some of the other methods I discuss on the blog such as K-means. You can find find the entire code for this post here. Until next time!

  • Improving the Video Game Recommender

    Improving the Video Game Recommender

    In my last post, we took the 380+ game video game attributes that we extracted from the Steam video game dataset and wrote an algorithm to cluster the attributes into 24 groups. In this post we will use the clusters to make an new and improved video game recommender. If you haven’t read my first
    and second post on the video game recommender, please read them before continuing.

    Step 1: Reconstructing the game features table

    The game features table that we built in the first post contained hundreds of attributes. We will construct a much smaller table using the game attribute clusters.

    Game Features Table

    The each column in the table indicates the magnitude of that attribute present in each game. The numbers are computed for each game using the attribute-cluster assignments we obtained from K-means. For each game, we count up the number of attributes that belong to each cluster. The construction of the new game features table is described in the code snippet below.

    # Fit the K Means clustering algorithm get the cluster assignments for each attribute
    km = KMeans(n_clusters=24, random_state=25)
    km.fit(feature_set)
    labels = km.predict(feature_set)
    attribute_assignments = pd.Series(labels, index=game_categories)
    Group attributes into list of clusters
    attribute_clusters = []
    for i in range(24):
    cluster = attribute_assignments[attribute_assignments == i]
    attribute_clusters.append(cluster.index.tolist())
    feature_columns = ['clust_'+str(i) for i in range(25)]
    game_features = []
    for idx in range(steam_games_df.shape[0]):
    # Obtain list of genres, tags, and specs
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    attributes = []
    data_row = {k:0 for k in feature_columns}
    data_row['id'] = steam_games_df.iloc[idx]['id']
    # Iterate through each entry in the lists and create the features
    if game_genre:
    attributes.extend(game_genre.split(','))
    if game_tags:
    attributes.extend(game_tags.split(','))
    if game_specs:
    attributes.extend(game_specs.split(','))
    attributes = set(attributes)
    if len(attributes) > 0:
    for attr in attributes:
    for i in range(len(attribute_clusters)):
    if attr in attribute_clusters[i]:
    data_row['clust_'+str(i)] += 1
    else:
    data_row['clust_24'] += 1
    game_features.append(data_row)
    game_features_df = pd.DataFrame(game_features)
    game_features_df = game_features_df.set_index('id')

    Step 2: Reconstructing the user features table

    Using the game features table we built in step 1, we will rebuild a much smaller user features table.

    User Features Table

    Similar to the game features table, each column in the user features table indicates the degree each user prefers a game with that attribute. The numbers are computed for each user by retrieving the features for each game played by the user from the game features table and adding them up.

    game_feat_dict = game_features_df.to_dict()
    # Read user items data file and build features table
    with open('/content/drive/MyDrive/VideoGameRecFiles/australian_users_items.json','r',encoding='utf8') as f:
    data = f.read()
    data = data.strip().split("\n")
    user_features = []
    # We need to keep track of all the games each user played so we can avoid recommending games that they have already played.
    user_play_list = {}
    for user_data in data:
    # The stdataset is not a properly formatted json file. Because of this we need to iterate through each individual JSON object and use
    # the ast module to parse the object.
    record = ast.literal_eval(user_data)
    data_row = {k:0 for k in feature_columns}
    data_row['user_id'] = record['user_id']
    play_list = []
    for item in record['items']:
    item_id = item['item_id']
    play_list.append(item_id)
    for col in feature_columns:
    if item_id in game_feat_dict[col]:
    data_row[col] += game_feat_dict[col][item_id]
    user_play_list[record['user_id']] = play_list
    user_features.append(data_row)
    user_features_df = pd.DataFrame(user_features)
    user_features_df = user_features_df.set_index("user_id").drop_duplicates()

    Step 3: Defining a new recommender function

    With our new game and user feature tables in place a new method for examining similarity between users is in order. In our first recommender, we used the matching dissimilarity score. In our new recommender, we’re going to use cosine similarity. Cosine similarity is a measure of similarity between two numerical vectors. It is the dot product between two vectors divided by the product of their lengths.

    Let’s suppose that we have a user that we want to generate recommendations for. We’ll call the user in question u and the number of recommendations we’d like to generate x. We will use cosine similarity to find another user named v who’s preference is the most similar to u. We will then select x games from v‘s play history that has not been played by u and then recommend them

    Here’s the new recommendation procedure in code form.

    def cosine_score(user1, user2):
    score = cosine_similarity(user1.values.reshape(1, -1), user2.values.reshape(1,-1))[0][0]
    return score
    def recommend_games(user_id, n=10):
    '''
    Given a user id, recommend games to that user. By default 10 games are recommended
    '''
    # Get user features
    user = user_features_df.loc[user_id]
    # Get games played by the user
    play_list = user_play_list[user_id]
    other_users = user_features_df[user_features_df.index != user_id]
    scores = other_users.apply(lambda user2: cosine_score(user, user2), axis=1).sort_values(ascending=False)
    rec_idx = 0
    recommended_games = user_play_list[scores.index[rec_idx]]
    recommended_games = list(filter((lambda gid: gid not in play_list), recommended_games))
    while (len(recommended_games) < n):
    rec_idx += 1
    additional_games = user_play_list[scores.index[rec_idx]]
    recommended_games.extend(list(filter((lambda gid: gid not in play_list), additional_games)))
    return recommended_games[:n]

    That’s all folks!

    You can find the code for this post here. Until next time!

  • Clustering Video Game Attributes

    Clustering Video Game Attributes

    In a previous blog post, I walked through the creation of a simple recommender system that recommends video games to existing users on Steam. Since creating the recommender, my student and I have been exploring ways to improve it. One enhancement we’ve been looking at is speeding up the computational time of the recommender by clustering video game attributes (tags, genres, and specs) into smaller, more manageable groups. In this post I will describe how we utilized the K-means algorithm to do this.

    Improving the Recommender by Computing Probabilities

    The current system uses 188 categorical attributes to recommend video games to existing users. The biggest disadvantage of this approach is the large amount of computational time required to build the tables the recommender requires. The numerous categorical features also makes adding numerical features (such as game price) to the system challenging; the influence of the categorical features will outweigh any impact the numerical features could have on the recommendations. To correct both issues, we will attempt to reduce the number of attributes the recommender using K-Means clustering.

    The raw, unprocessed video game metadata contains 380+ attributes. We observed from looking at the data for several video games that some attributes tend to appear together with other attributes. This gave me the idea to try to group these attributes based on the likihood the attributes will appear together. We will do this by constructing a square matrix that contains the conditional probability of observing any two attributes in a video game. We use this matrix to cluster the game attributes.

    The code snippet below shows the construction of the probability matrix. We iterate through the each game in the steam games dataset and construct a dictionary containing the sets of games that have each attribute.

    # Create a dictionary containing sets of games that have each attribute
    category_sets = {}
    for idx in range(steam_games_df.shape[0]):
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    game_id = steam_games_df.iloc[idx]['id']
    if game_genre:
    cat_genres = game_genre.split(",")
    for g in cat_genres:
    if g in category_sets:
    category_sets[g].add(game_id)
    else:
    category_sets[g] = set([game_id])
    if game_tags:
    cat_tags = game_tags.split(",")
    for t in cat_tags:
    if t in category_sets:
    category_sets[t].add(game_id)
    else:
    category_sets[t] = set([game_id])
    if game_specs:
    cat_specs = game_specs.split(",")
    for s in cat_specs:
    if s in category_sets:
    category_sets[s].add(game_id)
    else:
    category_sets[s] = set([game_id])

    We use the attributes dictionary to build the probability matrix.

    game_categories = list(category_sets.keys())
    probability_matrix = []
    for g in game_categories:
    prob_list = []
    for c in game_categories:
    game_intersection = category_sets[g].intersection(category_sets[c])
    prob_list.append(len(game_intersection) / len(category_sets[g]))
    probability_matrix.append(prob_list)
    probability_matrix = np.array(probability_matrix)

    Applying Dimensionality Reduction

    The probability matrix computed in the last section has 381 dimensions. As discussed in my high dimensional clustering post, clustering data with very high dimensions could be problematic. To avoid these problems, we’re gonna apply dimensionality reduction.

    The code snippet below uses the pca module provided by Sci-kit learn to perform Principal Component Analysis on the probability matrix. To determine the number of principal components to keep, we computed a cummulative sum of the explained variance ratios for each principal component. Check out my article on PCA for more details on how it works.

    pca = PCA()
    pca.fit(probability_matrix)
    
    total = 0
    for idx, r in enumerate(pca.explained_variance_ratio_):
      total += r
      print("{0} Components: {1}".format(idx, total))

    We decided to keep just enough components to explain 70% of the variability in the data. That number happened to be 31. We will refit PCA to the dataset and reduce the dimensions.

    # Refit PCA to the probability matrix and keep only the 31 principal components
    pca = PCA(n_components=31)
    pca.fit(probability_matrix)
    feature_set = pca.transform(probability_matrix)

    With the newly featurized dataset in place, we can now proceed with the data clustering.

    Discovering Game Attribute Categories

    We use the sihoulette method to find the optimial number of clusters for k-means. The code snippet below uses the Sci-kit learn implementation of k-means and silhouettee score to derive scores for different numbers of clusters. Check out this post for details on how the sihouette method works.

    range_n_clusters = range(2, 28)
    
    for n_clusters in range_n_clusters:
        clusterer = KMeans(n_clusters=n_clusters, random_state=25)
        cluster_labels = clusterer.fit_predict(feature_set)
    
        silhouette_avg = silhouette_score(feature_set, cluster_labels)
        print(
            "For n_clusters =",
            n_clusters,
            "The average silhouette_score is :",
            silhouette_avg,
        )

    Using the snippet above we determined the optimal number of clusters to be 24. Last, but not least, we group the game attributes into 24 clusters and output the group assignments.

    # Fit the K Means clustering algorithm get the cluster assignments for each attribute
    km = KMeans(n_clusters=24, random_state=25)
    km.fit(feature_set)
    labels = km.predict(feature_set)
    attribute_assignments = pd.Series(labels, index=game_categories)
    for i in range(24):
    cluster = attribute_assignments[attribute_assignments == i]
    print("Cluster {0}: {1}".format(i, ",".join(cluster.index)))
    Cluster 0: Moddable,Trading,City Builder,Building,Economy,Base Building,Sandbox,Management,Space,Political,Agriculture,Space Sim,Capitalism,Politics,Resource Management,God Game,Fishing,Mining
    Cluster 1: Action,Indie,Simulation,Strategy,Single-player,RPG,Multi-player,Online Multi-Player,Cross-Platform Multiplayer,Steam Achievements,Steam Trading Cards,Stats,Adventure,Full controller support,Downloadable Content,Steam Cloud,Steam Leaderboards,Partial Controller Support,Early Access,Shared/Split Screen,Valve Anti-Cheat enabled,Steam Turn Notifications,Co-op,Violent,Commentary available,Steam Workshop,Includes level editor,Western,Flight,Tower Defense,Game demo,On-Rails Shooter,Soundtrack,Pinball
    Cluster 2: 2D,Replay Value,Difficult,Pixel Graphics,Cute,Singleplayer,Great Soundtrack,Retro,Platformer,Side Scroller,Stylized,Arcade,Underground,Remake,Action-Adventure,Spectacle fighter,Character Action Game,Beat 'em up,Controller,Fast-Paced,2.5D,Ninja,Puzzle-Platformer,Time Attack,Colorful,3D Platformer,Psychedelic,Score Attack,1980s,Time Manipulation,Cartoon,Metroidvania,Blood,Runner,Cartoony,GameMaker
    Cluster 3: FPS,Shooter,Third-Person Shooter,Sniper,Third Person,Survival,Classic,Gore,Sci-fi,Aliens,First-Person,Stealth,Assassin,Hunting,Futuristic,Cyberpunk,Destruction,Mechs,Robots,Lara Croft,Dinosaurs,Parkour,3D Vision,Zombies,Survival Horror,Bullet Time,Arena Shooter,Post-apocalyptic,Inventory Management,Star Wars,6DOF,Heist,Transhumanism,Gun Customization,Mars
    Cluster 4: Mod,Mods,Mods (require HL2),Mods (require HL1)
    Cluster 5: Design & Illustration,Tutorial,Education,Animation & Modeling,Animation &amp; Modeling,Video Production,Utilities,Web Publishing,Game Development,Software Training,Design &amp; Illustration,Audio Production,Photo Editing,Accounting
    Cluster 6: HTC Vive,Oculus Rift,Tracked Motion Controllers,Room-Scale,VR,Seated,Standing,SteamVR Collectibles,Keyboard / Mouse,Gamepad,Windows Mixed Reality,360 Video
    Cluster 7: Character Customization,Open World,Crafting,Swordplay,Hack and Slash,Action RPG,Medieval,Pirates,Dragons,Voxel,Sailing
    Cluster 8: Card Game,Trading Card Game,Turn-Based,Board Game,Turn-Based Strategy,4X,Turn-Based Tactics,Warhammer 40K,Games Workshop,Hex Grid,Tactical RPG,Turn-Based Combat,Strategy RPG,Asynchronous Multiplayer,Chess
    Cluster 9: Female Protagonist,Nudity,Anime,Choices Matter,Multiple Endings,Romance,Visual Novel,Sexual Content,Interactive Fiction,Dating Sim,RPGMaker,Choose Your Own Adventure,Text-Based,Otome
    Cluster 10: Tactical,War,Rome,Historical,Wargame,Cold War,Real-Time with Pause,RTS,Diplomacy,World War II,Alternate History,Real-Time,Grand Strategy,Real Time Tactics,Military,Naval,Tanks,America,World War I,Modern
    Cluster 11: 1990's,Story Rich,Atmospheric,Silent Protagonist,Linear,Mystery,Experience,Psychological Horror,Horror,Exploration,Point & Click,Underwater,Lovecraftian,Demons,Detective,Supernatural,Steampunk,Dystopian,Dark,Mature,Noir,Cinematic,FMV,Cult Classic,Based On A Novel,Surreal,Short,Walking Simulator,Psychological,Time Travel,Hand-drawn,Experimental,Quick-Time Events,Conspiracy,Narration,Dynamic Narration,Lore-Rich,Conversation,Nonlinear,Philisophical,Mystery Dungeon
    Cluster 12: Realistic,Driving,Trains,TrackIR
    Cluster 13: Captions available,Episodic,Crime,Benchmark,Movie,Thriller,Werewolves,Documentary,Martial Arts,Drama,Gaming,Foreign,Feature Film,Hardware,Faith
    Cluster 14: Fantasy,Dark Fantasy,Gothic,Isometric,Vampire,Magic,Mythology,Villain Protagonist,CRPG,Dungeon Crawler,JRPG,Kickstarter,Investigation,Crowdfunded,Grid-Based Movement,Voice Control
    Cluster 15: Casual,Physics,Science,Clicker,Puzzle,Music,Hidden Object,Match 3,Touch-Friendly,Family Friendly,Level Editor,Abstract,Relaxing,Mouse only,Music-Based Procedural Generation,Rhythm,Minimalist,Hacking,Lemmings,Sokoban,Typing,Programming,Artificial Intelligence,Word Game,Spelling,Steam Machine
    Cluster 16: LEGO,Batman,Superhero,Comic Book
    Cluster 17: Party-Based RPG,Software
    Cluster 18: Sports,Racing,Golf,Horses,Offroad,Bowling,Mini Golf,Football,Soccer,Gambling,Basketball,Cycling,Pool,Wrestling
    Cluster 19: Free to Play,PvP,Competitive,In-App Purchases,Multiplayer,Massively Multiplayer,MMO,Online Co-op,MMORPG,Online Co-Op,Team-Based,Includes Source SDK,Class-Based,PvE,MOBA,e-sports
    Cluster 20: Local Co-op,Local Multi-Player,Co-op Campaign,Local Co-Op,Local Multiplayer,Fighting,Split Screen,2D Fighter,4 Player Local
    Cluster 21: Top-Down,Top-Down Shooter,Loot,Shoot 'Em Up,Twin Stick Shooter,Bullet Hell,Rogue-like,Procedural Generation,Rogue-lite,Perma Death
    Cluster 22: Funny,Comedy,Satire,Dark Humor,Memes,Parody,Illuminati,Intentionally Awkward Controls,NSFW,Dark Comedy
    Cluster 23: Bikes

    K-means does a pretty good job with grouping attributes into meaningful clusters. One downside you might notice that the cluster assignments will not be consistent when the snippet is run subsequent times. Because the initial centroids used by K-means are choosen at random, we will get different cluster outputs on the same dataset. We can work around this issue by performing the clustering several times and choosing the results that have the lowest sum of squared error.

    That’s all folks!

    In my next post, we will use the attribute clusters to make a new version of the recommender. We will then explore methods for evaluating how good of a job our recommenders do with recommending video games to users. You can find the code for the entire solution here.