Category: General

  • Using K-Medians for Outlier Clustering.

    Using K-Medians for Outlier Clustering.

    Hey, readers! I’m back again with another blog post. If you’ve been following the series of posts I’ve been releasing thus far you will already be well familiar with the k-means clustering algorithm. In my last blog post I covered k-modes, a variation of k-means that can be used for clustering categorial data. In this blog post I will introduce you to another variation of k-means called k-medians.

    The case for K-Medians

    In the K-Means post, I mentioned that one of the limitations of the algorithm is its sensitivity to outliers. The mere existence of just one extreme value in the dataset can have a huge influence on the placement of the cluster centroids and could cause outliers to get their own clusters instead of being ignored.

    One way to avoid this issue is to remove the outliers from the dataset prior to clustering. But what if that wasn’t an option for you? What if there’s just too many outliers in your dataset for you to simply remove? That’s where K-Medians comes in.

    What makes K-Medians different from K-Means is two things:

    First off, K-Medians computes the dissimilarity between data points using Manhattan distance instead of Euclidean distance. More on Manhattan distance later in this post.

    Secondly, as the name suggests, K-Medians computes new cluster centroids using the median. The median is a robust measure of a dataset’s center and this therefore less sensitive to the existence of outliers in the dataset.

    To demonstrate this fact, let’s suppose we have a small dataset of values:

    1, 6, 9, 7, 12

    The mean of the dataset is 7.If another data point with the value 7000 was added to the data the resulting mean would be 1172.5, which is drastically different from the original dataset.

    If we computed the median of the dataset with and without the new data the result would be 7 and 8 respectively. The median of the dataset with the additional datapoint is only slightly higher than the median of the original dataset.

    By using the median instead of mean for calculating new centers for each cluster, K-medians does not suffer from the outlier sensitivity that K-means has.

    About the Manhattan Distance

    Another name for the Manhattan distance is the city block distance.

    Imagine each cell is a building and the grid lines are side walks. If one wanted to travel from point A to point B in the picture, he/she could not take a straight path. Both paths available in the image would require a turn to be taken. The Manhattan distance can be used to calculate the distance walked.

    The equation for the Manhattan distance is the following:

    Allow me to break this formula down and explain how the distance is calculated.

    Given two data points, x and y the formula adds up the absolute difference between each component from x and y.  As an example, computing the Manhattan distance between the data points (3,5,9,1) and (7,2,0,8) yields the following:

    |3 – 7| + |5 – 2| + |9 – 0| + |1 – 8| = 23

    The Algorithm

    Now that you’re aware of the Manhattan distance and the median, here’s the breakdown of the K-Medians clustering algorithm

    1. Select k data points at random to be initial center points
    2. Use the Manhattan distance to calculate dissimilarities between each datapoint and the center points.
    3. Assign each datapoint to the cluster with the smallest Manhattan distance.
    4. Use the median to compute new center points for each cluster
    5. Repeat steps 2, 3, and 4 until there’s no change in the cluster assignments.

    Choosing K

    As with K-means, the Sihouette method can be used to find the optimal value for k. A modified version of the elbow method can also be used. Instead of using the sum of squared error to calculate the total distance between each data point and it’s centroid, the absolute difference is used:

    For each cluster the absolute error between the data point pi and the centroid µj is computed. These differences are then summed up to get the total error for the clustering.

    From there the process of producing the elbow plot is the same as previously described in my earlier posts.

    What’s Next

    That’s it folks. In my next post, I will cover yet another variation of k-means that is also robust to outliers. Until next time.

  • Using K-Modes to Cluster Categorical Data

    Using K-Modes to Cluster Categorical Data

    If you’ve been following the series of posts I’ve written thus far, you will have already become familiar with the k-means clustering algorithm and how to determine the optimal number of clusters to produce using the algorithm. This clustering method, unfortunately, is only useful for numerical datasets. What happens if we want to cluster non-numerical data, such as names, locations, colors, and companies? The clustering algorithm I will cover is a variation of k-means that can be used on categorial data. This method is called K-Modes.

    So, what is the K-Modes Algorithm?

    The K-Modes clustering procedure is very similar to k-means, actually.

    Lets say we have a small dataset of individual categorical data that we’d like to cluster into 3 groups:

     Education LevelEthnicityOccupationMarital StatusState
    P1AssociatesHispanicAnalystSingleFlorida
    P2BachelorsCaucasianEngineerDivorcedCalifornia
    P3MastersNative AmericanEngineerSingleNorth Carolina
    P4BachelorsHispanicEngineerDivorcedCalifornia
    P5AssociatesAfrican AmericanAccountantMarriedTexas
    P6AssociatesCaucasianEngineerMarriedTexas
    P7BachelorsCaucasianAccountantWidowedFlorida
    P8AssociatesAsianTeacherMarriedTexas
    P9BachelorsAfrican AmericanAccountantSingleFlorida
    P10MastersCaucasianLawyerSingleTexas

    The process to cluster the dataset can be broken down to 4 steps.

    Step 1: Select k data points at random to be initial center points

    Just like with K-means, the method starts out with deciding the number of clusters desired and then choosing at random several points to be the initial center points for the algorithm.

    From our little example dataset, we will choose P2, P6, and P9 as center points (centroids):

     Education LevelEthnicityOccupationMarital StatusState
    P2BachelorsCaucasianEngineerDivorcedCalifornia
    P6AssociatesCaucasianEngineerMarriedTexas
    P9BachelorsAfrican AmericanAccountantSingleFlorida

    Step 2: Calculate dissimilarities between each datapoint and the center points

    K-means used Euclidean distance to determine how similar (or dissimilar) a datapoint was to a centroid. K-modes uses a different method for determining similarity.

    K-modes compares each datapoint to the centroids and count up the number of dissimilar attributes.

    When we take P1 for example, and compare it to each centroid, here’s what we get:

    P1 => P2

     Education LevelEthnicityOccupationMarital StatusState
    P1AssociatesHispanicAnalystSingleFlorida
    P2BachelorsCaucasianEngineerDivorcedCalifornia

    Number of Dissimilarities: 5

    P1 => P6

     Education LevelEthnicityOccupationMarital StatusState
    P1AssociatesHispanicAnalystSingleFlorida
    P6AssociatesCaucasianEngineerMarriedTexas

    Number of Dissimilarities: 4

    P1 => P9

     Education LevelEthnicityOccupationMarital StatusState
    P1AssociatesHispanicAnalystSingleFlorida
    P9BachelorsAfrican AmericanAccountantSingleFlorida

    Number of Dissimilarities: 3

    Step 3: Assign data points to a cluster

    As with K-means, each datapoint is assigned to the centroid with the smallest dissimilarity score.

    Below is a table containing the dissimilarities between each data point and the centroids and its assigned cluster.

     Cluster 1 (P2)Cluster 2 (P6)Cluster 3 (P9)Assigned Cluster
    P1543Cluster 3
    P2035Cluster 1
    P3454Cluster 1
    P4144Cluster 1
    P5523Cluster 2
    P6405Cluster 2
    P7341Cluster 3
    P8525Cluster 2
    P9350Cluster 3
    P10434Cluster 2

    Step 4: Calculate new cluster centroids

    This step is where k-modes gets its name from. We compute the new centroids for each cluster by using the mode. If there are values that have the same number of occurrence, one of the values can be chosen at random.

    The new centroids for each cluster are:

     Cluster 1

     Education LevelEthnicityOccupationMarital StatusState
    P2BachelorsCaucasianEngineerDivorcedCalifornia
    P3MastersNative AmericanEngineerSingleNorth Carolina
    P4BachelorsHispanicEngineerDivorcedCalifornia
     New CentroidBachelorsCaucasianEngineerDivorcedCalifornia

    Cluster 2

     Education LevelEthnicityOccupationMarital StatusState
    P5AssociatesAfrican AmericanAccountantMarriedTexas
    P6AssociatesCaucasianEngineerMarriedTexas
    P8AssociatesAsianTeacherMarriedTexas
    P10MastersCaucasianLawyerSingleTexas
    New CentroidAssociatesCaucasianAccountantMarriedTexas

    Cluster 3

     Education LevelEthnicityOccupationMarital StatusState
    P1AssociatesHispanicAnalystSingleFlorida
    P7BachelorsCaucasianAccountantWidowedFlorida
    P9BachelorsAfrican AmericanAccountantSingleFlorida
    New CentroidBachelorsCaucasianAccountantSingleFlorida

    Step 5: Repeat steps 2, 3 and 4

    As with k-means, the algorithm  reassigns data points to clusters based on their similarity to the centroids. New centroids are then computed, and the process repeats until no changes in the cluster assignments are made.

    Determining the optimal ‘k’

    Choosing the optimal number of clusters for k-modes can be done using a slightly modified version of the elbow method employed for k-means. Here’s the steps to follow:

    1. Perform K modes clustering for a range of values of k.
    2. After each clustering, compute the  dissimilarity score between each datapoint and it’s assigned cluster centroid, and add them up to get the total dissimilarity score.
    3. Create a line plot of the total dissimilarity score for each value of k.
    4. Finally look at the plot for the value of k where the dissimilarity score begins to decrease linearly. This value is your optimal number of clusters for the dataset.

    Now that you know the k-modes clustering algorithm, what datasets do you plan of using it to cluster? Share your plans in the comments below.


    Until next time! 🙂

  • K-Means Parameter Selection

    K-Means Parameter Selection

    In my last post, I provided a 101 on the K-means clustering algorithm. In case you haven’t had a chance to read the post this link will take you to it. One question you might be wondering after reading the post is how to determine out the optimal number of clusters for k-means. That’s exactly what this post is all about. At the end of this post you will have two techniques in your back-pocket that you can use to determine the optimal value for k in K-means.

    Elbow Method

    The first technique is the simplest of the two and is called the Elbow method. The Elbow method involves the following steps:

    1. Perform K means clustering for a range of values for k.
    2. After each clustering, compute the sum of squared error ( or SSE for short) of each cluster. — More on that in a bit
    3. Create a line plot of the SSE for each value of k. This plot is called the elbow plot
    4. Look at the elbow plot and find the value of k where the SSE begins to decrease linearly. That value is the optimal number of clusters in your dataset.

    So, what is the Sum of Squared Error?

    The cluster sum of squared error is the average distance each data point is from the centroid. Calculating this involves the following steps:

    1. For each cluster and centroid produced by k-means
      1. Calculate the difference between each data point and its cluster centroid. Square the difference. We’ll call the result e
      2. Compute the sum of all the differences. Let’s call the result t
    2. Next we will add up the total distances computed for each cluster. The result of the calculation is the sum of squared error.

    It is best summarized with this formula:

    The Intuition of the Elbow Plot

    When you plot the SSE for each value of k, you will immediately see why this this plot is called the Elbow plot. It’s because the plot looks like an elbow!

    As the number of clusters increases, the sum of squared error decreases. Our optimal number of clusters for k means will be a number where the SSE is small. If this number is too small though, we will end up with too many clusters. If this number is too big, then we will end up with too few clusters to make sense of the data we have. The purpose of the elbow plot is to help us find the value of k where we start to see diminishing returns. This point is our optimal value for k. It just so happens that this point on the plot is the elbow part of the arm.

    As you can see from the plot above, this elbow appears to occur at k = 3, making this the optimal number of clusters.

    But what happens when you have a plot with a fairly smooth or linear curve and no obvious elbow point, such as the one below.

    This is where the next method I will show you comes in handy.

    Introducing the Sihouette Method

    The Sihouette method involves computing coefficients for each point in the dataset. These coefficients measure how similar a point is to its assigned cluster compared to other clusters. The value of each coefficient range between -1  and 1. Coefficients on the high end of the range indicate that the data point is well matched to its assigned cluster and poorly matched to other clusters. Coefficients on the low end of the range indicate that the data point is poorly matched to its cluster compared to other clusters.

    Let’s go through how to compute a sihouette coefficient for a simple data point:

    1. First we calculate the average distance the data point is from all other points in the cluster. We’ll call this number a.
    2. Next we calculate the average distance of that data point from points in the cluster closest to its assigned cluster. Let’s call this number b.
    3. Finally we take the two measurements we computed and calculate the coefficient by subtracting a from b and dividing by the larger of the two numbers.

    This process is best summarized with this formula:

    If most of the data points have  high coefficient values, then the number of clusters used is optimal. If, on the other hand there are many data point that have low coefficient values, this indicates that either two many or too few clusters.

    A sihouette plot can be used to visually inspect how close each point in one cluster is to points in neighboring clusters.

    The x axis in the plot displays the sihouette coefficient values. The y axis displays the labels for each cluster. The red dashed line is the average sihouette coefficient, computed by averaging all the coefficients in the dataset.

    Here’s some rules of thumb to follow when analyzing a sihouette plot:

    • If all the cluster plots is beyond the average sihoutte score, have mostly uniform thickness, and do not have wide fluctuation in size, that’s a good indication that the number of clusters used is optimal.
    • The number of clusters is sub optimal if you observe either of the following:
    1. The plot for a cluster falls below the average coefficient. Such as the plot from this link: https://vitalflux.com/wp-content/uploads/2020/09/Screenshot-2020-09-16-at-11.59.52-AM.png
    2. There are wide fluctuations in the size and thickness of the cluster plots.

    You have just learned two methods for choosing the optimal number of clusters to form for k-means. If you enjoyed this post, subscribe to my blog to keep abreast of more posts like this.

  • K-Means.. The Simple Clustering Algorithm

    K-Means.. The Simple Clustering Algorithm

    If you do a quick Google search for clustering algorithms, k-means will undoubtedly be the first algorithm that will be mentioned in the search results, and for good reason too. It’s the first algorithm that most professionals turn to for segmenting datasets.  Compared to other clustering methods out there, k-means is both very simple to implement and understand. In this post I will explain how this algorithm works and explain some pros and cons that you should be aware of when using this algorithm. In future posts, I will delve into more technical aspects of k-means and also walk through a coded implementation of this algorithm.

    Here goes…

    So what exactly is k-means?

    The k-means algorithm divides a dataset into groups. The datapoints in each group are in close proximity of each other (at least as close to each other as possible). The k in k-means comes from the choice in the number of clusters to create. When using this algorithm, the practitioner provides two things:

    1. A dataset, and
    2. The number of clusters to create. We call this number k.

    The algorithm will then take the dataset and create the desired number of clusters. The means in k-means comes from the method in which the clusters are formed. The algorithm repeatedly estimates the center of each cluster by calculating the mean (average) of the data points in the cluster.  Another name for this center point is centroid. Each data point in the dataset is assigned a cluster based on how close it is to a centroid. This will make more sense once we go through the next section.

    Before K-Means
    After K-Means

    So how exactly does it work?

    Here’s a breakdown of the clustering algorithm in all of it’s parts.

    Choose initial centroids
    Assign points to cluster
    Calculate new centroids
    Repeat until done
    1. Choose k. The first step is to determine the number of clusters you’d like to create. We’ll call this number k
    2. Select initial cluster centroids. In this step the algorithm chooses k data points from the dataset at random. These data points serve as center points (centroids) for the clusters.
    3. Assign data points to a cluster.  In this step the algorithm calculates the distance between each data point in the dataset and each centroid and assigns the data point to the centroid it is closest to.
    4. Calculate new cluster centroids. The algorithm calculates the new center points for each cluster by computing the calculating the average of each data point in the cluster.
    5. Repeat steps 3 and 4. The algorithm reassigns data points to clusters based on how close they are to the new centroids. After the reassignment new cluster centroids are computed. This process is repeated until no changes in the assignments are made.

    But wait, how is distance between a data point and a centroid calculated?

    There are many methods that can be used to calculate distance (i.e. similarity) between points, many of which I plan on covering at a later time. But the method most commonly used is Euclidean distance. The formula is as follows:

    So, what are the advantages of k-means?

    Very Simple to Implement. This is probably one of the biggest advantages of this algorithm. Not only is it very easy to implement, it is also very easy to interpret and visualize compared to other algorithms out there.

    Very efficient. The k-means clustering algorithm is highly scalable and can efficiently cluster large datasets.

    Highly adaptable to new data points. K-means clustering algorithm is extremely flexible and can adjust the clustering of the data as new data points are added to the dataset. This is really great for real-time applications where new data points can be inserted to the dataset on the fly.

    And, what are the disadvantages of k-means?

    K must be chosen manually. As you may have already surmised from the explanation of the algorithm, K-means is not capable of determining the optimal number of clusters in the dataset automatically. You must specify the number of clusters in advance. In a future post I will cover some strategies you can use to determine the optimal number of clusters for K-means

    Very sensitive to noisy data and outliers. The presence of extreme data points in the dataset can have a huge impact on the quality of the clusters that are formed. The placement of cluster centroids, for instance can be heavily influenced by outliers. Or better yet, outliers might even get their own cluster instead of being ignored. To prevent any of these from occurring, make sure to remove these outliers from the dataset before clustering.

    Dataset with outliers before clustering
    Dataset with outliers after clustering

    Results are inconsistent. Due to the fact that the initial center points are chosen at random, you can run k-means multiple times on the same dataset and get different answers. What people therefore do in practice is run k-means several times with different initial values and pick the one with the best result.

    Works best with spherical datasets. One big assumption K-means has on the dataset is that it contains spherically shaped clusters. This is not always the case however. When given datasets such as the one below, K-means may produce a result that might not be what you’d expect.

    Non spherical dataset before clustering
    Non spherical dataset after clustering

    Assumes clusters have similar density. Another big assumption that K-means makes on the dataset is that the clusters are even in size and density. When given dataset with uneven distributions of points per cluster, the clustering starts to break down.

    What’s next?

    You now have a high level understanding of k-means clustering and how it works. Other the next few posts, I will start diving into more technical details on the algorithm, including strategies for choosing the optimal number of clusters, and other distance methods you can use with k-means.

  • The Obligatory Hello World Post

    The Obligatory Hello World Post

    Welcome to SegmentationPro!

    This a blog that explores methods for clustering and analyzing data of all shapes, sizes, and types with a focus towards customer segmentation. Customer segmentation is a powerful tool that businesses employ in order to effectively market their products and services to new customers. It can also help businesses improve the quality of service provided to existing customers and promote customer loyalty and retention.

    Most of the data science and machine learning blog posts I found on the topic usually refer to K-Means clustering or a similar unsupervised learning. There are many more algorithms that can be employed for this purpose and I wanted to create a resource for anyone looking for other methods to employ in their projects.

    Whether you are a budding data analyst, an experienced professional, or a marketing professional looking to expand their skillset, my intention is to make SegmentationPro assessable to all interested parties. Subscribe to my blog to keep apprised of new posts and updates.