Tag: segmentation

  • Calculating Lifetime Value with Cohorts

    Calculating Lifetime Value with Cohorts

    Customer lifetime value is a metric that can be used to gauge the health of a business. In my last blog post on the topic, I showed python code that estimates customer lifetime value by computing a simple average. Today I will show you another approach that is an improvement over the one described previously.

    Introducing Cohort Aggregates

    This method for estimating customer lifetime value computes multiple metrics, one for each cohort of customers in the dataset. Not only does this provides a much better estimate over a simple average, but it also allows us to study the behavior of customers in response to different marketing campaigns. We will carry out the calculation using the same <link to dataset> dataset used in the last post.

    Let’s begin the same way as before by loading the dataset

    with open('CDNOW_master.txt') as f:
        dataset = f.read().split("\n")
    
    records = []
    for line in dataset:
        if line == '':
            continue
        row = list(filter(lambda token: token != '', line.split(' ')))
        rec = {}
        rec['customerID'] = row[0]
        rec['purchaseDate'] = datetime.strptime(row[1], '%Y%m%d')
        rec['quantity'] = int(row[2])
        rec['price'] = float(row[3])
        records.append(rec)
    transactions_df = pd.DataFrame(records)

    Next we’ll create a new column in our dataframe that contains total dollar value for each transaction. This column will be used later when we calculate customer lifetime value for each cohort.

    transactions_df['total'] = transactions_df['price'] * transactions_df['quantity']

    Speaking of cohorts, allow me to explain what I mean by this. For the same of this analysis I’ll define a cohort as a group of customers who made their first transaction in the same month. In order to perform this grouping, we first need to figure out what month each person became a customer. The following code snippet will create a dataframe containing the information we need.

    first_transactions_df = transactions_df.groupby('customerID')['purchaseDate'].min().reset_index()
    
    first_transactions_df['firstTransactionMonth'] = first_transactions_df['purchaseDate'].dt.month
    
    first_transactions_df['firstTransactionMonth'].value_counts()
    Number of newly acquired customers by month

    Amazing! Now we can merge this dataframe into our transactions dataset so everything is in one place.

    
    transactions_df = pd.merge(transactions_df, first_transactions_df[['customerID','firstTransactionMonth']], on='customerID')
    Customer Transactions Dataframe

    We’re now ready to calculate customer lifetime value for each cohort. To do this we will define a function that computes the customer lifetime value for a subset of customers.

    def calculate_cltv(df):
        transactions_per_customer = df.groupby('customerID')['purchaseDate'].count()
        avg_frequency = transactions_per_customer.mean()
        
        minmax_purchase_dates_by_customer = df.groupby('customerID')['purchaseDate'].agg(['min','max'])
        
        customer_lifetimes = minmax_purchase_dates_by_customer.apply(lambda row: (row['max'] - row['min']).days, axis=1)
        avg_lifetime = customer_lifetimes.mean()
        
        avg_order_value = transactions_df['total'].mean()
        
        return avg_frequency * avg_lifetime * avg_order_value

    All that’s left to do now is to aggregate our transactions by the firstTransactionMonth and apply the function we defined earlier to each group.

    transactions_df.groupby('firstTransactionMonth').apply(calculate_cltv)
    Customer Lifetime Value by Cohort

    Lo and behold we are done!

    That’s all folks!

    You can find the complete code for this post here. Until next time!

  • Simple yet effective RFM

    Simple yet effective RFM

    Looking for a quick and dirty way to segment your customer data that does not require machine learning? In this post we will take a look at a python implementation of a popular technique used by marketers to divide customers into buckets of high, medium, and low value customers. The technique is called RFM.

    What is RFM?

    RFM quantitively ranks and groups customers based on three factors:

    How recent was the customer’s last purchase. The more recently a customer made a purchase, the more likely that customer will make a purchase again as he/she would likely still have the product on their mind.

    How frequently did the customer purchase products. Frequent patrons are likely to purchase again in the future and are therefore valuable to the business.

    The amount of money the customer spent. Customers who spent a lot of money during a particular period of time are very valuable to the business, as they may make another purchase in the future.

    As you may have guessed RFM is short for recency, frequency, and monetary. Here’s how it works:

    1. First, rate each customer on each of the three factors. Generally a scale of 1 to 5 is used, with 5 being the highest possible score. However, you are free to choose a different scaling that makes better sense for your problem.
    2. Next step, average each score together to compute an RFM score. The higher the score is for a customer the more valuable the customer is. Alternatively, you can compute a weighted average of the three scores if certain factors in your problem are more important than others.
    3. Lastly, group each customer into high, medium, and low value buckets based on the RFM score.

    RFM in action

    Now that you’re acquainted with RFM, we’ll apply it to some customer data so you can see it in action. The dataset we’ll use is from an UK based online retail store.

    We’ll start by reading the provided CSV file and inspecting the data attributes.

    transactions_df = pd.read_csv('data.csv',encoding='ISO-8859-1')
    transactions_df.info()

    Here’s what you’ll see as output:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 541909 entries, 0 to 541908
    Data columns (total 8 columns):
     #   Column       Non-Null Count   Dtype  
    ---  ------       --------------   -----  
     0   InvoiceNo    541909 non-null  object 
     1   StockCode    541909 non-null  object 
     2   Description  540455 non-null  object 
     3   Quantity     541909 non-null  int64  
     4   InvoiceDate  541909 non-null  object 
     5   UnitPrice    541909 non-null  float64
     6   CustomerID   406829 non-null  float64
     7   Country      541909 non-null  object 
    dtypes: float64(2), int64(1), object(5)
    memory usage: 33.1+ MB

    For this exercise, we will be focusing on the InvoiceNo, Quantity, InvoiceDate, UnitPrice, and CustomerID attributes.

    The CustomerID attribute uniquely identifies an online store customer.

    The InvoiceNo, and InvoiceDate attributes uniquely identifies a transaction performed by a customer and the date it was performed.

    A transaction in this dataset is composed of one or more items purchased at various quantities. The Quantity and UnitPrice attributes contains the amount of an item purchased and the cost per item.

    Looking at the output above, it appears that the CustomerID attribute contains missing values. We will proceed by discarding the rows with the missing values from the dataset.

    transactions_df = transactions_df[transactions_df['CustomerID'].notnull()].copy()

    Next, we will convert the InvoiceDate attribute into a datetime attribute to make it easier to work with.

    transactions_df['InvoiceDate'] = pd.to_datetime(transactions_df['InvoiceDate'])

    Looking at the transaction data, it looks like there’s some transactions with negative quantities. These are likely refunds made to the customer. We will discard these transactions from our analysis as well.

    transactions_df = transactions_df[transactions_df['Quantity'] > 0]

    Next up, we will create three new pandas dataframes containing computed recency, frequency, and monetary values for each customer.

    We will make use of the newly converted InvoiceDate attribute to obtain the most recent transaction for each customer and compute the number of days since the latest transaction in the entire dataset.

    # Get the most recent transaction in the entire dataset
    most_recent_transaction = transactions_df['InvoiceDate'].max()
    
    # Compute the lastest transaction for each user
    latest_transactions_per_user = transactions_df.groupby('CustomerID')['InvoiceDate'].max()
    
    recency_df = latest_transactions_per_user.reset_index()
    recency_df['recency'] = recency_df['InvoiceDate'].apply(lambda date: (most_recent_transaction - date).days)
    
    # Inspect the results
    recency_df.head()
    recency values per customer

    Next we will create a dataframe containing the number of reqeat transactions made by each customer. The code snippet below accomplishes this by computing the number of unique invoices produced for each customer.

    # Calculate the number of invididual invoices generated for each customer
    num_transactions_per_user = transactions_df.groupby('CustomerID')['InvoiceNo'].unique().apply(lambda lst: len(lst))
    
    frequency_df = num_transactions_per_user.reset_index().rename(columns={'InvoiceNo':'frequency'})
    
    # Inspect the results
    frequency_df
    frequency values per customer

    Finally we will calculate the total spend for each customer.

    # Create a new column containing the total amount spent for each item
    transactions_df['Total'] = transactions_df['Quantity'] * transactions_df['UnitPrice']
    
    # Now, aggregate total spend by customer
    monetary_df = transactions_df.groupby('CustomerID')['Total'].sum().reset_index().rename(columns={'Total':'monetary'})
    
    monetary_df
    monetary value per customer

    Let’s merge the recency, frequency, and monetary dataframes into one dataframe.

    rfm_df = pd.merge(pd.merge(recency_df, frequency_df, on='CustomerID'), monetary_df, on='CustomerID').drop('InvoiceDate',axis=1)
    combined recency, frequency, and monetary values

    We’ll now rank the recency, frequency, and monetary values on a scale of 1 to 5. We will use the pandas cut function can be used to bin each value into 5 evenly spaced intervals. You can also specify your own binning intervals you’d like cut to use as well.

    rfm_df['recency_score'] = pd.cut(rfm_df['recency'], bins=5, labels=[5,4,3,2,1]).astype(int)
    rfm_df['frequency_score'] = pd.cut(rfm_df['frequency'], bins=5, labels=[1,2,3,4,5]).astype(int)
    rfm_df['monetary_score'] = pd.cut(rfm_df['monetary'], bins=5, labels=[1,2,3,4,5]).astype(int)
    rankings per customer

    Using the recency, frequency, and monetary ratings we will create an RFM score. We will calculate scores for each user by computing a weighed average of the three rankings.

    rfm_df['rfm_score'] = (0.35*rfm_df['recency_score'] + 0.35*rfm_df['frequency_score'] + 0.40*rfm_df['monetary_score'])
    rfm_df['rfm_score'] = rfm_df['rfm_score'].round(2)
    RFM scores per user

    Now that the RFM scores are computed for each customer, let’s form three segments of high, medium, and low value customers.

    • We’ll define high tier customers as customers with RFM scores greater than 3
    • Medium tier customers will be defined as customers with RFM scores between 2 and 3
    • Customers with RFM scores less than 2 will be placed in the low tier bucket.
    def segment_customer(score):
        if score >= 3:
            return 'High'
        elif score >= 2 and score < 3:
            return 'Medium'
        else:
            return 'Low'
    
    rfm_df['segment'] = rfm_df['rfm_score'].apply(segment_customer)
    Customer segments

    We got our segments! We can aggregate our customers by the segments by see what what it looks like.

    rfm_df.groupby('segment')['CustomerID'].count()

    The output…

    segment
    High        16
    Low       1041
    Medium    3282
    Name: CustomerID, dtype: int64

    That’s all folks!

    RFM is a simple and quick way to segment your customers into buckets that you can quickly take action on. If you’re looking for segmentation techniques that utilize additional customer attributes, check out some of the other methods I discuss on the blog such as K-means. You can find find the entire code for this post here. Until next time!

  • Improving the Video Game Recommender

    Improving the Video Game Recommender

    In my last post, we took the 380+ game video game attributes that we extracted from the Steam video game dataset and wrote an algorithm to cluster the attributes into 24 groups. In this post we will use the clusters to make an new and improved video game recommender. If you haven’t read my first
    and second post on the video game recommender, please read them before continuing.

    Step 1: Reconstructing the game features table

    The game features table that we built in the first post contained hundreds of attributes. We will construct a much smaller table using the game attribute clusters.

    Game Features Table

    The each column in the table indicates the magnitude of that attribute present in each game. The numbers are computed for each game using the attribute-cluster assignments we obtained from K-means. For each game, we count up the number of attributes that belong to each cluster. The construction of the new game features table is described in the code snippet below.

    # Fit the K Means clustering algorithm get the cluster assignments for each attribute
    km = KMeans(n_clusters=24, random_state=25)
    km.fit(feature_set)
    labels = km.predict(feature_set)
    attribute_assignments = pd.Series(labels, index=game_categories)
    Group attributes into list of clusters
    attribute_clusters = []
    for i in range(24):
    cluster = attribute_assignments[attribute_assignments == i]
    attribute_clusters.append(cluster.index.tolist())
    feature_columns = ['clust_'+str(i) for i in range(25)]
    game_features = []
    for idx in range(steam_games_df.shape[0]):
    # Obtain list of genres, tags, and specs
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    attributes = []
    data_row = {k:0 for k in feature_columns}
    data_row['id'] = steam_games_df.iloc[idx]['id']
    # Iterate through each entry in the lists and create the features
    if game_genre:
    attributes.extend(game_genre.split(','))
    if game_tags:
    attributes.extend(game_tags.split(','))
    if game_specs:
    attributes.extend(game_specs.split(','))
    attributes = set(attributes)
    if len(attributes) > 0:
    for attr in attributes:
    for i in range(len(attribute_clusters)):
    if attr in attribute_clusters[i]:
    data_row['clust_'+str(i)] += 1
    else:
    data_row['clust_24'] += 1
    game_features.append(data_row)
    game_features_df = pd.DataFrame(game_features)
    game_features_df = game_features_df.set_index('id')

    Step 2: Reconstructing the user features table

    Using the game features table we built in step 1, we will rebuild a much smaller user features table.

    User Features Table

    Similar to the game features table, each column in the user features table indicates the degree each user prefers a game with that attribute. The numbers are computed for each user by retrieving the features for each game played by the user from the game features table and adding them up.

    game_feat_dict = game_features_df.to_dict()
    # Read user items data file and build features table
    with open('/content/drive/MyDrive/VideoGameRecFiles/australian_users_items.json','r',encoding='utf8') as f:
    data = f.read()
    data = data.strip().split("\n")
    user_features = []
    # We need to keep track of all the games each user played so we can avoid recommending games that they have already played.
    user_play_list = {}
    for user_data in data:
    # The stdataset is not a properly formatted json file. Because of this we need to iterate through each individual JSON object and use
    # the ast module to parse the object.
    record = ast.literal_eval(user_data)
    data_row = {k:0 for k in feature_columns}
    data_row['user_id'] = record['user_id']
    play_list = []
    for item in record['items']:
    item_id = item['item_id']
    play_list.append(item_id)
    for col in feature_columns:
    if item_id in game_feat_dict[col]:
    data_row[col] += game_feat_dict[col][item_id]
    user_play_list[record['user_id']] = play_list
    user_features.append(data_row)
    user_features_df = pd.DataFrame(user_features)
    user_features_df = user_features_df.set_index("user_id").drop_duplicates()

    Step 3: Defining a new recommender function

    With our new game and user feature tables in place a new method for examining similarity between users is in order. In our first recommender, we used the matching dissimilarity score. In our new recommender, we’re going to use cosine similarity. Cosine similarity is a measure of similarity between two numerical vectors. It is the dot product between two vectors divided by the product of their lengths.

    Let’s suppose that we have a user that we want to generate recommendations for. We’ll call the user in question u and the number of recommendations we’d like to generate x. We will use cosine similarity to find another user named v who’s preference is the most similar to u. We will then select x games from v‘s play history that has not been played by u and then recommend them

    Here’s the new recommendation procedure in code form.

    def cosine_score(user1, user2):
    score = cosine_similarity(user1.values.reshape(1, -1), user2.values.reshape(1,-1))[0][0]
    return score
    def recommend_games(user_id, n=10):
    '''
    Given a user id, recommend games to that user. By default 10 games are recommended
    '''
    # Get user features
    user = user_features_df.loc[user_id]
    # Get games played by the user
    play_list = user_play_list[user_id]
    other_users = user_features_df[user_features_df.index != user_id]
    scores = other_users.apply(lambda user2: cosine_score(user, user2), axis=1).sort_values(ascending=False)
    rec_idx = 0
    recommended_games = user_play_list[scores.index[rec_idx]]
    recommended_games = list(filter((lambda gid: gid not in play_list), recommended_games))
    while (len(recommended_games) < n):
    rec_idx += 1
    additional_games = user_play_list[scores.index[rec_idx]]
    recommended_games.extend(list(filter((lambda gid: gid not in play_list), additional_games)))
    return recommended_games[:n]

    That’s all folks!

    You can find the code for this post here. Until next time!

  • High Dimensional Clustering 101

    High Dimensional Clustering 101

    High dimensional data are datasets containing a large number of attributes, usually more than a dozen. There are a few things you should be aware of when clustering datasets such as these. The goal of this post is to highlight a few strategies you can use when performing high dimensional clustering.

    Setting the Stage

    To set the stage for the discussion take a look at the image below.

    Which one of the red points would you say is the closest to the x in the center? As you can see, there’s no clear answer. Some of you reading this may even say that is a nonsensical question to be asking, because these points are roughly equidistant from x. Even if there is a single point that is closer than the others, it would only be by a measly, negligible amount.

    When a data clustering algorithm looks at high dimensional data, the data points could look much like the above image. The results we get back from the algorithms may not capture the patterns present in the dataset. This is known in machine learning as the curse of dimensionality.

    Dum Dum Duuuuuuuuumm… The curse of dimensionality

    The curse of dimensionality is a very interesting, and counterintuitive phenomenon that arises when machine learning algorithms are applied to datasets with a large number of features.

    Data clustering algorithms work by computing distances between data points and grouping together points that are close together in proximity. When the number of features in a dataset is small, the algorithms are able to clearly the data points that are close together from the ones that are not. For datasets with many features, the task of identifying points that are closer together is more challenging. The reason for this is because the distance between points in high dimensions are far away. This is true even for points that are similar to each other.

    Pictured below are two charts. The chart on the left is a 1-dimensional plot of several data points. The chart on the right is a histogram of pairwise distances between each data point pictured in the first plot.

    As shown in the plot on the left, the dataset can be neatly grouped into three clusters. You can also see this from the three peaks in the histogram on the right.

    Here’s what the charts look like when we add a second dimension to the data.

    With the addition of the second dimensions, the distances between each point is larger. Nevertheless a clustering algorithm can still easily cluster these points into 3 groups.

    The same observation can be made if a third dimension was added to the data.

    As more dimensions are added to the dataset the distances between the points become larger and the peaks we observed in the histogram become narrower. After 100,000 dimensions we’ll get a histogram shaped like the one below:

    Euclidean distance of datapoints on a dataset with 100,000 dimensions.

    Wait, isn’t there supposed to be three peaks? At 100,000 dimensions, the distance between the data points became so large that the third cluster we saw in the first 3 plots became indistinguishable from the other two. If we were to run K-means on this dataset, the clusters it would give us would definitely not be what we’d expect.

    Now that you are aware of the curse of dimensionality I will briefly highlight in the next few sessions some methods that you can use to work around it.

    Use Feature Selection

    You don’t need to cluster with every attribute in your dataset. If you’re only interested in a handful of attributes, you can cluster on just those attributes and discard the others. You should also examine your dataset for features that show very little variation or are highly correlated with other features. Those features will not provide any meaningful information to a clustering algorithm and can therefore be discarded.

    Use Dimensionality Reduction

    Another approach for clustering high dimensional data entails projecting the dataset to a lower dimensional space before applying a clustering algorithm. Oftentimes a dataset can be approximated very well using a smaller number of dimensions without losing much information. This can be accomplished by applying techniques from linear algebra such as Principle Component Analysis and Singular Value Decomposition. I will explore both topics more in detail in future posts.

    Use a Graph-based Clustering Algorithm

    Graph based clustering algorithms such as Spectral clustering uses a different criteria for forming clusters that is more robust in high dimensions than K-means and other related methods. I will explore spectral clustering more in detail in a future post.

    That’s all folks!

    In my next post, I explore the topic of dimensionality reduction further by discussing the nuts and bolts of Principle Component Analysis. Stay tuned.

  • Let’s Get Fuzzy with C Means Clustering

    Let’s Get Fuzzy with C Means Clustering

    Not every dataset you will encounter in the wild will have data points that can be neatly placed in a single cluster. You may come across data points that could potentially belong to two, three, or possibly more clusters.

    Consider this scenario. You work as a data analyst for a company that has developed a fitness accountability app. During the first week of January, the company experienced a large number of new app installs. These installs are presumably from people looking to fulfill their new year resolution fitness goals. At the end of the month you gathered a week’s worth of activity data from several of these new users. Below is a scatter plot of this data.

    Data points in red represent users who deleted the app after 1 week of usage. The data points in green are users who are still actively using the app. From the plot it is clear that there’s a relationship between level of app engagement and likelihood to churn. Users furthest to the left are more likely to delete the app after a week, while users furthest to the right are more likely to remain active on the app. A high degree of uncertainty, however, exists for users in the middle of the plot. These users could either churn or remain active. 

    Let’s now suppose you were given the task to group these users into a “churn” and “not churn” categories. You could use a clustering method like K-means to group the users, but information about the degree of uncertainty in the categories would be lost. A new clustering method is needed that would allow you to model the degree a data point is similar to or dissimilar from users that churn and remain active.

    Enter Fuzzy C Means Clustering

    C Means is a k-means variation that assigns each data point probabilities of belonging to each cluster. C Means differs from k-means in two key ways:

    The first difference is that c means constructs a table (i.e. matrix) that contains the probability that each datapoint belongs to a particular cluster. Here’s what the table would look like for a few data points in the problem outlined above:

    UserChurnNot Churn
    A0.250.75
    B0.910.09
    C0.510.49
    D0.170.83

    Each entry in the table has a number between 0 and 1. The larger the number, the more likely that data point belongs to the cluster. The sum of probabilities for each datapoint must also equal 1. If you look at the probabilities for each data point in the data table above, you will see that they indeed do.

    This table, which we will call the fuzzy partition table, is randomly initialized at the start of c-means algorithm and then repeatedly updated, along with the centroid. Each entry in the table is updated using the following equation:

    dist(xi,cj) and dist(xi,cq) represents the Euclidean distance between a datapoint xi and a cluster centroid cj and between xi and cq. You may have also noticed the m exponential term in the numerator and denominator. This is a special parameter that controls the fuzziness of the results. When m is 0, the equation will produce a binary output (1 or 0), akin to what you’d get if K-means was applied. As a matter of fact, you can think about K-means as C-means with the m parameter set to 0.  The larger m is the fuzzier the output is.

    The second key difference lies in the calculation of the centroid on each iteration of the algorithm. The formula is as follows

    If you read my previous post on weighted kmeans, you’ll notice that this is essentially the weighted mean.

    With that, here’s the steps of the C means algorithm:

    1. Choose c, the number of clusters, and m
    2. Randomly select c data points from the dataset to be initial centroids
    3. Randomly choose values for each entry in the fuzzy partition table, subject to the constraint that the values sum to 1.
    4. Compute new centroids
    5. Update the fuzzy partition table
    6. Repeat step 4 and 5 until either there’s no change in the values in the table or the difference is smaller than some threshold.

    Choosing the C in C-Means

    For C means you can use the Elbow method to find the optimal value for c. Below is a modified version of SSE to use when choosing the optimal value for c.

    That’s all folks!

    In my next post I will discuss methods for clustering high dimensional data.

  • Scaling your data before clustering

    Scaling your data before clustering

    You will often need to perform some form of preprocessing on your dataset before running a data clustering algorithm. In this post I will introduce one important preprocessing method that is often overlooked when clustering data. That method, my friends, is called data scaling.

    What is data scaling?

    As you may already know, clustering algorithms work by computing distances (i.e. dissimilarities) between data points in the dataset and grouping together points that are close in proximity. The method used for calculating the distance will be different depending on the algorithm used. One trait most of these methods share is high sensitivity to attributes (also called features) with numerically larger values compared to other attributes in a dataset. These features will bias the distance calculations and can cause the clustering algorithms to produce subpar clusterings.

    To illustrate this, let’s say you have a dataset containing SAT and GPA scores for 5000 college students that you’d like to segment into 4 groups.

    StudentSATGPA
    114191.35
    213841.16
    37843.58
    47483.31
    511240.98
    49966832.02
    49977393.58
    49987413.21
    499910881.01
    50005951.60

    After running K-means on the dataset, you create a scatter plot showing the student scores and the clusters each student is assigned to.

    It’s immediately apparent looking at the plot that a large number of students were mislabeled by the algorithm. SAT scores have a significantly larger numerical scale (400 to 1600) than that of grade point averages (0 to 4). SAT scores therefore have a much larger influence on the  distance measurement K-means uses to group the students.

    As can be seen from the plot, K-means leaned heavily on the SAT scores to cluster the data. To ensure that both GPA and SAT has an equal weight in the clustering both features need to be transformed so they are on the same scale. This is exactly what data scaling is for.

    Here’s the scatter plot we get after scaling the data set and performing k-means:

    Student clusters after applying data scaling

    Data Scaling Techniques

    There are two main methods for scaling data. The first method is called normalization and the second method is called standardization.

    Normalization

    Normalization uses the minimum and maximum to transform features onto the same scale. Below is the formula used to normalize each feature:

    Here’s what the student scores dataset would look like when each feature is normalized:

    StudentSATGPA
    10.8491670.3375
    20.8200000.2900
    30.3200000.8950
    40.2900000.8275
    50.6033330.2450
    49960.2358330.5050
    49970.2825000.8950
    49980.2841670.8025
    49990.5733330.2525
    50000.1625000.4000

    Normalization rescales the dataset features between 0 and 1 or -1 and 1. Normalization is typically used when the numerical distribution of the features is not known. Normalization fails however on datasets with outliers. Allow me to explain why.

    Suppose you have a small dataset containing the following points.

    XY
    01
    23
    53
    1013
    1517
    2030
    2223
    2419
    99025
    100027

    As you can see from the table, the last two data points are outliers. Using normalization, the data points will transform into the following

    XY
    0.0000.000
    0.0020.083
    0.0050.083
    0.0100.500
    0.0150.666
    0.0200.458
    0.0220.583
    0.0240.708
    0.9900.833
    1.0001.000

    Even though normalization shifts the values between 0 and 1, the outliers still remain as outliers. If we were to compare these points using a distance function, the [] feature would have more influence on the measurement.

    Standardization

    Standardization uses the mean and standard deviation to transform a dataset. The formula for transforming each feature is shown below:

    Here, µ is the mean of the feature and σ is the standard deviation of the feature. This is what the same student score dataset from earlier would look like when standardized.

    StudentSATGPA
    11.393781-0.447397
    21.283359-0.634040
    3-0.6095871.743211
    4-0.7231631.477980
    50.463083-0.810861
    4996-0.9282320.210768
    4997-0.7515571.743211
    4998-0.7452481.379747
    49990.349506-0.781391
    5000-1.205864-0.201813

    Standardization is very helpful in cases where the features in the dataset are normally distributed. But your data doesn’t necessarily have to be normally distributed in order to use standardization. Unlike normalization, standardization does not squeeze your data into a bounding range, meaning that it will still work if there’s outliers present.

    That’s all, folks!

    There are many other data scaling techniques that can be used, but normalization and standardization are the two methods that are mainly used in practice. In my next post I will discuss fuzzy clustering.

  • Weighing your k-means data clustering

    Weighing your k-means data clustering

    In my previous post on the K-Prototypes algorithm, I introduced the concept of parameter weighting. With just a single parameter, you can control the amount of influence categorical attributes has on the data clustering. In this post, I’m going to expand on this concept and show how we can control the weightings of entire data points. There are some data clustering problems you will encounter in the real world where this technique is essential.

    Customer Segmentation Case Study

    Consider the following scenario. You were hired by a growing online distributor of flavored milk to help them understand the shopping habits of their customers. The company would like to segment their customers into different groups. In particular, the company would like to understand the representative spend share for each customer group it can effectively firm up its marketing strategy. You are handed a dataset containing the percentage of sales in five different milk products over the past year.

    CustomerIDOrganic MilkCondensed MilkAlmond MilkCoconut MilkSkim Milk
    113.4526.0511.144.9927.44
    224.7319.5216.583.4923.65
    325.5020.4718.142.6623.49
    44.520.107.369.2713.47
    526.4721.0918.41.9021.75
    69.4315.3711.891.7825.81
    74.8321.115.489.4713.80
    85.7618.097.429.9111.25
    95.3618.274.849.9713.01
    107.9920.176.637.912.39

    Customer segmentation problems such as this one is an excellent use case for K-means clustering. The centroids (i.e. center points) produced by k-means can provide the representative customer characteristics that your client is looking for. Taking a look at the distribution of total spend for each customer, you notice that the numbers are heavily skewed.

    Some customers a high spend share in one or more categories, but their total spend is so low that they are less likely to respond to the company’s marketing campaign. Because K-means weighs each data point evenly, it will not take the customer total spend into account and will produce clusters and center points that are misleading.

    In order to get the output we’re looking for, we need to weigh each customer differently. Customers who have a higher total spend should have more influence on the cluster centroids than customers who do not. This is exactly what weighted k-means clustering allows us to do.

    About Weighted K-means clustering

    To perform weighted k-means clustering, we need to make a minor tweak to the way the cluster centroids are calculated after each iteration. Instead of using the mean the calculate the centroids, we use the weighed mean:

    Each data point xi in the dataset is assigned a weighted value wi. Computing the weighted mean for a cluster is just a matter of:

    1. Multiplying each data point by its weight
    2. Computing the sum the products calculated in step 1 together, and
    3. Dividing the sum calculated in step 2 by the sum of all the weights.

    As usual, here’s the weighted k-means clustering procedure, taking in account the new centroid calculation discussed above:

    1. Choose k. Determine the number of clusters to form
    2. Select initial centroids. In this step the algorithm chooses k data points from the dataset at random to be the initial centroids.
    3. Assign data points to a centroid.  In this step the algorithm calculates the Euclidean distance between each data point and each centroid. Data points are assigned to the centroid with the smallest distance.
    4. Calculate weighted centroids. New centroids are calculated by finding the weighted mean of each cluster
    5. Repeat steps 3 and 4. Repeat steps 3 and 4 until no changes in the assignments are made.

    Weighted K-means and Outliers

    Weighed k-means clustering is also useful when clustering datasets with outliers. By assigning small weight values to outlier data points, k-means will form clusters that are robust to these extreme values.

    That’s all folks!

    I hope you found this guide on weighted k-means useful. In my next post, I will discuss data scaling, an important but often overlooked concept that has a huge impact on the data clustering algorithms.

  • Mixed Data Clustering using K-Prototypes

    Mixed Data Clustering using K-Prototypes

    This blog has thus far covered a number of data clustering algorithms. A majority of the algorithms operate under the assumption that the dataset to be clustered is purely numerical. The k-Modes algorithm was the one method we’ve covered that works on purely categorial data. What if you want to cluster a dataset that contains both categorial and numeric data. That’s exactly what the k-prototypes algorithm allows you to do. 

    What is a Prototype?

    In K-Means, the centroids are found by calculating the mean of each cluster’s data points. In K-Modes, these are derived from the mode. For datasets with mix of categorical and numerical data, the centroids are found by calculating the mean of the numeric data and taking the mode of the categorial data. A prototype is simply a combination of the mean and the mode.

     CityEye ColorAgeScore
     MiamiBlue3293
     MiamiGreen2532
     ChicagoBlack2772
     MiamiBlue2163
    PrototypeMiamiBlue26.2565

    The function that is used to check how similar or different each data point is to a prototype is also a combination of the Euclidean distance from K-Means and the matching function from K-Modes:

    Let’s take this function apart and examine what each piece means:

    The red part of the function above is the k-means Euclidean measurement, and the green part is the k-modes matching dissimilarity measurement.

    The symbols p and q in the function above represent two individual data points.

    N and C represent sets of numeric and categorial variables and n and c represent a variable from both sets. The symbols pn, qn, pc, qc represent an numeric and categorial values.

    Something new you might have also noticed is the gamma term:

    This term controls the amount of weight categorial variables contribute to the output.

    A large enough value for gamma would result in more focus being placed categorial variables. As a result, a point could be assigned to a cluster where the cluster’s prototype is not nearest to it in terms of numeric variables, but the categorial values of most of the other points in the cluster matches that point.

    Here’s an example of the dissimilarity function in action using two data points:

     CityEye ColorAgeScore
    pMiamiBlack3293
    qChicagoBlack2772

    From the function above, N would represent a set containing the numeric variables {Age, Score}, and C is a set containing categorial variables {City, Eye Color}.

    Calculating the Euclidean (red) part of the function, we get the following:

    For the matching (green) part of the function, lets suppose that gamma is set to 1.5. Here’s what we’d get:

    Adding the results from both calculations together, the final result is 23.08.

    Here’s the step by step breakdown of the k-prototypes procedure, taking into account everything covered so far:

    1. Choose k and gamma. Determine the number of clusters to form and the amount of weight categorial variables should have on the clustering.
    2. Select initial prototypes. In this step the algorithm chooses k data points from the dataset at random to be the initial prototypes.
    3. Assign data points to a prototype.  In this step the algorithm calculates the dissimilarity between each data point in the dataset and each prototype and assigns the data point to the prototype it is closest to.
    4. Calculate new prototypes. A new prototype is calculated for each cluster using the dissimilarity function described earlier.
    5. Repeat steps 3 and 4. The algorithm reassigns data points to clusters based on how close they are to the new prototypes. After the reassignment new prototypes are computed. This process is repeated until no changes in the assignments are made.

    Determining Optimal Number of Clusters

    Like all the other methods covered before this one, the Elbow method can be used to determine optimal number of clusters. Here’s the new cost function you will be using for the elbow plot:

    As you might have guessed, this is simply a combination of the sum of squared error from k-means and the total dissimilarity score from k-modes.

    But wait, what about optimal values for gamma?

    With the addition of the gamma term, you might be wondering if there’s a method for finding good values for that parameter. Finding optimal values involves pure experimentation. You perform k-prototypes on your dataset several times with different values and choose a value that weighs the categorical and numeric attributes in the way you expect.

    If you’re looking for a quick and dirty value that provides a decent balance between the categorical and numeric attributes in your dataset, a good heuristic you can use is the standard deviation of the numeric attributes in the dataset.  Z. Huang, in his early experiments with the k-prototypes algorithm, found a suitable value to lie between 1/3 and 2/3 of the standard deviation of the numeric data. Here’s a link to his paper where he describes the experiments more in detail.

    That’s all, folks!

    I hope you enjoyed this tutorial. One thing that was new with this segmentation method was the usage of a gamma paramter to control the influence variables have on the output. In my next post I will cover another clustering algorithm that also makes use of weights. Stay tuned.

  • Mediods Data Clustering with CLARANS

    Mediods Data Clustering with CLARANS

    K-Medoids is a data clustering algorithm that is commonly used on datasets containing extreme values. In previous posts on the subject, I covered two implementations of the algorithm that can be used to find medoids in a dataset.

    The first method, called PAM, scans every data point in the dataset for k medoids. PAM can accurately locate the medoids in the dataset, but at expense of efficiency on large datasets.

    CLARA was the second method I covered. It’s able to achieve superior performance on large datasets by restricting the scan to a smaller randomly sampled portion of the dataset. An issue that may arise with CLARA, however, is that the true medoids of the dataset might get excluded from the sample, causing the method to return sub-optimal results.

    This is where the next algorithm I will discuss comes into play. This algorithm, like CLARA, does not check every data point in the dataset. But unlike CLARA, its scan is not restricted to a small sample of data points, but at random neighborhoods of points selected at multiple iterations. This method is called CLARANS, which is short for  Clustering LARge Applications based on RANdomized Search.

    The Search For the Best Medoids Data Clustering

    The CLARANS clustering algorithm starts out just like any other k clustering method. K data points are chosen at random to be the medoids that all the other data points in the data set will cluster around. From there, CLARANS searches for neighboring medoids that cluster the data better. A better clustering is defined as a set of medoids where the total dissimilarity between each medoid and the data points assigned to it is as small as possible.

    For the sake of making this process CLARANS takes to search for better medoids easier to understand, I want you to imagine that you are standing in a large room littered with dozens of water filled holes.  You and your friends are playing a game where you are tasked with finding the deepest hole in the room. The catch is that this has to be done blindfolded.

    Each hole in the room represents a set of k data points from the dataset chosen at random.

    After your friends place you at one of the holes, you begin your task for looking for the deepest hole in the room. Because you’re unable to determine this visually, you do this by sticking your bare foot into the holes and measuring the depth by how high the water rises up your leg. The depth a hole represents the total cluster dissimilarity for that set of k medoids.

    You make a mental note of  the depth of the hole that’s in front of you and then randomly choose one of the nearby holes to measure next.

    In the case of the CLARANS clustering algorithm, two sets of k medoids are neighbors if they share all but one medoid.

    Neighboring set of k-medoids, where k = 7

    CLARANS selects a neighbor by randomly selecting one of the k medoids from the dataset and replacing it with a data point from the dataset. From there it computes the total cluster dissimilarity of the neighbor and compares it with the dissimilarity of the current set of medoids. 

    CLARANS searches several neighbors, keeping track of the best set of mediods it finds thus far. The maximum number of neighbors that is searched is actually predetermined at the beginning of the algorithm. We’ll call this parameter maxneighbors. Each time CLARANS finds a better set of medoids, it searches maxneighbors more neighbors to see if it finds an even better set. If it fails to find a better clustering after examining additional neighbors, the search is concluded. The set of medoids CLARANS settles on is known as a local minima.

    The following flow chart below summarizes the local search described above:

    CLARANS data clustering local search flow diagram

    From Local Search to Global Search

    Going back to our toy scenario, after searching several neighboring holes, you were able to find the deepest hole within a the small region of holes that you searched. That hole you found, may not be the deepest hole in the room.

    To truly determine this, you decided to repeat your local neighborhood search at other random holes in the room, and compare the deepest hole you found with what you just found. With that your friends proceed to move you to other random holes in the room where you can conduct additional searches for local minimas.

    Translating this process to the CLARANS algorithm, the number of local searches that is performed is another parameter that is determined at the start of the algorithm. We’ll call this number numlocal. After performing several of these local searches, CLARANS has a pretty good approximation of the true medoids in the dataset.  

    The entire CLARANS data clustering process can be summarized with the following steps:

    1. Specify parameters numlocal, k, and maxneighbors.
    2. Let B represent the best k mediods discovered so far.
    3. Let f represent the number of local searches to perform. Set this number to 1.
    4. Let H represent  k randomly selected medoids from the data set.
    5. Let c represent the number of neighbors inspected. Set this number to 1.
    6. Let s represent a random medoid from H. Let j represent a data point from the dataset that is not in H. Let N equal a set of k medoids that share the same points as H, except s is replaced by j.
    7. If the total cluster dissimilarity in N is less than that of H, set H to N and go to step 5.
    8. Increment c by 1. If c < maxneighbors, go back to step 6.
    9. If the total dissimilarity of H is less than that of B, set B to H.
    10. Increment f by 1. If f  < numlocal, go back to step 4.
    11. Output B.

    That’s all, folks!

    In the next post, we will be covering a method that can be used to cluster datasets with a mixture of numeric and categorical data. Stay tuned.

  • Faster Mediods Data Clustering with CLARA

    Faster Mediods Data Clustering with CLARA

    Previously I introduced K-Medoids clustering, a data clustering method that can be used to cluster datasets containing outliers. The Partition Around Medoids (PAM) algorithm I described in my post is computationally slow and expensive and is typically used on small datasets because of that. What if we have a very large dataset that we’d like to cluster using K-Medoids. Do we have no other choice but to sit around a wait for the time consuming clustering method to finish? Not at all! In this post I will introduce you to CLARA, a K-Medoids implementation that can be on large datasets. Let’s get started!

    CLARA Data Clustering .. PAM, But with A Twist

    CLARA is short for Clustering LARge Applications. The PAM Algorithm computes medoids  by finding the data point with the shortest distance from all the other points in its cluster. Computing medoids using PAM is very quick for small datasets. But when it comes to much larger datasets, this could take quite a bit of time. 

    CLARA addresses this issue by performing PAM on a randomly selected sample of data points. We will call the size of this smaller dataset s. The idea behind this approach computing the medoids of this sample would give us an approximation of the medoids if the entire dataset was clustered, assuming that data points from the sample is representative of the entire dataset .

    This clustering of sampled data points is performed not once, but several times. In order to ensure that the best possible clustering is achieved, CLARA draws several random samples of data points, performs PAM on each, and then determines the clustering that produces the best results. We will call the number of random samples and clustering performed n.

    How does CLARA determine which clustering result is the best, you ask?

    That all depends on the dissimilarity (i.e. distance) method that was used to form the clusters. If Euclidean distance was used, than best is defined as the result with the smallest sum of squared error. If Manhattan distance was used, then the total absolute difference is used. See my articles on K-Means, and K-Medians for more information about these measures.

    The medoids from the best clustering is then used to assign the data points from the entire dataset to the appropriate cluster.

    Dataset before CLARA

    Step 1: Sample data points

    Step 2: Compute sample medoids
    Step 3: Assign data points to medoids

    With that, the entire process is outline in the following steps

    1. Select n samples of the dataset, each with the size s.
    2. Perform PAM on each sample
    3. Determine the PAM clustering that produces the most optimal results.
    4. Use the selected PAM clustering to assign each datapoint from the entire dataset to the appropriate centroid.

    With the CLARA implementation, we have an approach for efficiently computing approximate medoids for our dataset. There are some disadvantages to this approach:

    The quality of the results returned by CLARA is heavily dependent on the data sampled. If it turns out that there was some bias in the selection of the data  points, then the clustering results will not be good.

    The effectiveness of CLARA also depends on the sample size. CLARA can’t produce a good clustering if any of the medoids sampled is far from the actual medoids. This will most definitely the case if the sample sizes are too small. On the other hand, sample sizes that are too large would result in longer processing times. Therefore, when using CLARA one must choose a sample size that provides an acceptable tradeoff between speed and accuracy.

    Coming up next

    My next post will cover the CLARANS data clustering algorithm, another K-Medoids implementation. Until next time.