Tag: clustering

  • Improving the Video Game Recommender

    Improving the Video Game Recommender

    In my last post, we took the 380+ game video game attributes that we extracted from the Steam video game dataset and wrote an algorithm to cluster the attributes into 24 groups. In this post we will use the clusters to make an new and improved video game recommender. If you haven’t read my first
    and second post on the video game recommender, please read them before continuing.

    Step 1: Reconstructing the game features table

    The game features table that we built in the first post contained hundreds of attributes. We will construct a much smaller table using the game attribute clusters.

    Game Features Table

    The each column in the table indicates the magnitude of that attribute present in each game. The numbers are computed for each game using the attribute-cluster assignments we obtained from K-means. For each game, we count up the number of attributes that belong to each cluster. The construction of the new game features table is described in the code snippet below.

    # Fit the K Means clustering algorithm get the cluster assignments for each attribute
    km = KMeans(n_clusters=24, random_state=25)
    km.fit(feature_set)
    labels = km.predict(feature_set)
    attribute_assignments = pd.Series(labels, index=game_categories)
    Group attributes into list of clusters
    attribute_clusters = []
    for i in range(24):
    cluster = attribute_assignments[attribute_assignments == i]
    attribute_clusters.append(cluster.index.tolist())
    feature_columns = ['clust_'+str(i) for i in range(25)]
    game_features = []
    for idx in range(steam_games_df.shape[0]):
    # Obtain list of genres, tags, and specs
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    attributes = []
    data_row = {k:0 for k in feature_columns}
    data_row['id'] = steam_games_df.iloc[idx]['id']
    # Iterate through each entry in the lists and create the features
    if game_genre:
    attributes.extend(game_genre.split(','))
    if game_tags:
    attributes.extend(game_tags.split(','))
    if game_specs:
    attributes.extend(game_specs.split(','))
    attributes = set(attributes)
    if len(attributes) > 0:
    for attr in attributes:
    for i in range(len(attribute_clusters)):
    if attr in attribute_clusters[i]:
    data_row['clust_'+str(i)] += 1
    else:
    data_row['clust_24'] += 1
    game_features.append(data_row)
    game_features_df = pd.DataFrame(game_features)
    game_features_df = game_features_df.set_index('id')

    Step 2: Reconstructing the user features table

    Using the game features table we built in step 1, we will rebuild a much smaller user features table.

    User Features Table

    Similar to the game features table, each column in the user features table indicates the degree each user prefers a game with that attribute. The numbers are computed for each user by retrieving the features for each game played by the user from the game features table and adding them up.

    game_feat_dict = game_features_df.to_dict()
    # Read user items data file and build features table
    with open('/content/drive/MyDrive/VideoGameRecFiles/australian_users_items.json','r',encoding='utf8') as f:
    data = f.read()
    data = data.strip().split("\n")
    user_features = []
    # We need to keep track of all the games each user played so we can avoid recommending games that they have already played.
    user_play_list = {}
    for user_data in data:
    # The stdataset is not a properly formatted json file. Because of this we need to iterate through each individual JSON object and use
    # the ast module to parse the object.
    record = ast.literal_eval(user_data)
    data_row = {k:0 for k in feature_columns}
    data_row['user_id'] = record['user_id']
    play_list = []
    for item in record['items']:
    item_id = item['item_id']
    play_list.append(item_id)
    for col in feature_columns:
    if item_id in game_feat_dict[col]:
    data_row[col] += game_feat_dict[col][item_id]
    user_play_list[record['user_id']] = play_list
    user_features.append(data_row)
    user_features_df = pd.DataFrame(user_features)
    user_features_df = user_features_df.set_index("user_id").drop_duplicates()

    Step 3: Defining a new recommender function

    With our new game and user feature tables in place a new method for examining similarity between users is in order. In our first recommender, we used the matching dissimilarity score. In our new recommender, we’re going to use cosine similarity. Cosine similarity is a measure of similarity between two numerical vectors. It is the dot product between two vectors divided by the product of their lengths.

    Let’s suppose that we have a user that we want to generate recommendations for. We’ll call the user in question u and the number of recommendations we’d like to generate x. We will use cosine similarity to find another user named v who’s preference is the most similar to u. We will then select x games from v‘s play history that has not been played by u and then recommend them

    Here’s the new recommendation procedure in code form.

    def cosine_score(user1, user2):
    score = cosine_similarity(user1.values.reshape(1, -1), user2.values.reshape(1,-1))[0][0]
    return score
    def recommend_games(user_id, n=10):
    '''
    Given a user id, recommend games to that user. By default 10 games are recommended
    '''
    # Get user features
    user = user_features_df.loc[user_id]
    # Get games played by the user
    play_list = user_play_list[user_id]
    other_users = user_features_df[user_features_df.index != user_id]
    scores = other_users.apply(lambda user2: cosine_score(user, user2), axis=1).sort_values(ascending=False)
    rec_idx = 0
    recommended_games = user_play_list[scores.index[rec_idx]]
    recommended_games = list(filter((lambda gid: gid not in play_list), recommended_games))
    while (len(recommended_games) < n):
    rec_idx += 1
    additional_games = user_play_list[scores.index[rec_idx]]
    recommended_games.extend(list(filter((lambda gid: gid not in play_list), additional_games)))
    return recommended_games[:n]

    That’s all folks!

    You can find the code for this post here. Until next time!

  • Clustering Video Game Attributes

    Clustering Video Game Attributes

    In a previous blog post, I walked through the creation of a simple recommender system that recommends video games to existing users on Steam. Since creating the recommender, my student and I have been exploring ways to improve it. One enhancement we’ve been looking at is speeding up the computational time of the recommender by clustering video game attributes (tags, genres, and specs) into smaller, more manageable groups. In this post I will describe how we utilized the K-means algorithm to do this.

    Improving the Recommender by Computing Probabilities

    The current system uses 188 categorical attributes to recommend video games to existing users. The biggest disadvantage of this approach is the large amount of computational time required to build the tables the recommender requires. The numerous categorical features also makes adding numerical features (such as game price) to the system challenging; the influence of the categorical features will outweigh any impact the numerical features could have on the recommendations. To correct both issues, we will attempt to reduce the number of attributes the recommender using K-Means clustering.

    The raw, unprocessed video game metadata contains 380+ attributes. We observed from looking at the data for several video games that some attributes tend to appear together with other attributes. This gave me the idea to try to group these attributes based on the likihood the attributes will appear together. We will do this by constructing a square matrix that contains the conditional probability of observing any two attributes in a video game. We use this matrix to cluster the game attributes.

    The code snippet below shows the construction of the probability matrix. We iterate through the each game in the steam games dataset and construct a dictionary containing the sets of games that have each attribute.

    # Create a dictionary containing sets of games that have each attribute
    category_sets = {}
    for idx in range(steam_games_df.shape[0]):
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    game_id = steam_games_df.iloc[idx]['id']
    if game_genre:
    cat_genres = game_genre.split(",")
    for g in cat_genres:
    if g in category_sets:
    category_sets[g].add(game_id)
    else:
    category_sets[g] = set([game_id])
    if game_tags:
    cat_tags = game_tags.split(",")
    for t in cat_tags:
    if t in category_sets:
    category_sets[t].add(game_id)
    else:
    category_sets[t] = set([game_id])
    if game_specs:
    cat_specs = game_specs.split(",")
    for s in cat_specs:
    if s in category_sets:
    category_sets[s].add(game_id)
    else:
    category_sets[s] = set([game_id])

    We use the attributes dictionary to build the probability matrix.

    game_categories = list(category_sets.keys())
    probability_matrix = []
    for g in game_categories:
    prob_list = []
    for c in game_categories:
    game_intersection = category_sets[g].intersection(category_sets[c])
    prob_list.append(len(game_intersection) / len(category_sets[g]))
    probability_matrix.append(prob_list)
    probability_matrix = np.array(probability_matrix)

    Applying Dimensionality Reduction

    The probability matrix computed in the last section has 381 dimensions. As discussed in my high dimensional clustering post, clustering data with very high dimensions could be problematic. To avoid these problems, we’re gonna apply dimensionality reduction.

    The code snippet below uses the pca module provided by Sci-kit learn to perform Principal Component Analysis on the probability matrix. To determine the number of principal components to keep, we computed a cummulative sum of the explained variance ratios for each principal component. Check out my article on PCA for more details on how it works.

    pca = PCA()
    pca.fit(probability_matrix)
    
    total = 0
    for idx, r in enumerate(pca.explained_variance_ratio_):
      total += r
      print("{0} Components: {1}".format(idx, total))

    We decided to keep just enough components to explain 70% of the variability in the data. That number happened to be 31. We will refit PCA to the dataset and reduce the dimensions.

    # Refit PCA to the probability matrix and keep only the 31 principal components
    pca = PCA(n_components=31)
    pca.fit(probability_matrix)
    feature_set = pca.transform(probability_matrix)

    With the newly featurized dataset in place, we can now proceed with the data clustering.

    Discovering Game Attribute Categories

    We use the sihoulette method to find the optimial number of clusters for k-means. The code snippet below uses the Sci-kit learn implementation of k-means and silhouettee score to derive scores for different numbers of clusters. Check out this post for details on how the sihouette method works.

    range_n_clusters = range(2, 28)
    
    for n_clusters in range_n_clusters:
        clusterer = KMeans(n_clusters=n_clusters, random_state=25)
        cluster_labels = clusterer.fit_predict(feature_set)
    
        silhouette_avg = silhouette_score(feature_set, cluster_labels)
        print(
            "For n_clusters =",
            n_clusters,
            "The average silhouette_score is :",
            silhouette_avg,
        )

    Using the snippet above we determined the optimal number of clusters to be 24. Last, but not least, we group the game attributes into 24 clusters and output the group assignments.

    # Fit the K Means clustering algorithm get the cluster assignments for each attribute
    km = KMeans(n_clusters=24, random_state=25)
    km.fit(feature_set)
    labels = km.predict(feature_set)
    attribute_assignments = pd.Series(labels, index=game_categories)
    for i in range(24):
    cluster = attribute_assignments[attribute_assignments == i]
    print("Cluster {0}: {1}".format(i, ",".join(cluster.index)))
    Cluster 0: Moddable,Trading,City Builder,Building,Economy,Base Building,Sandbox,Management,Space,Political,Agriculture,Space Sim,Capitalism,Politics,Resource Management,God Game,Fishing,Mining
    Cluster 1: Action,Indie,Simulation,Strategy,Single-player,RPG,Multi-player,Online Multi-Player,Cross-Platform Multiplayer,Steam Achievements,Steam Trading Cards,Stats,Adventure,Full controller support,Downloadable Content,Steam Cloud,Steam Leaderboards,Partial Controller Support,Early Access,Shared/Split Screen,Valve Anti-Cheat enabled,Steam Turn Notifications,Co-op,Violent,Commentary available,Steam Workshop,Includes level editor,Western,Flight,Tower Defense,Game demo,On-Rails Shooter,Soundtrack,Pinball
    Cluster 2: 2D,Replay Value,Difficult,Pixel Graphics,Cute,Singleplayer,Great Soundtrack,Retro,Platformer,Side Scroller,Stylized,Arcade,Underground,Remake,Action-Adventure,Spectacle fighter,Character Action Game,Beat 'em up,Controller,Fast-Paced,2.5D,Ninja,Puzzle-Platformer,Time Attack,Colorful,3D Platformer,Psychedelic,Score Attack,1980s,Time Manipulation,Cartoon,Metroidvania,Blood,Runner,Cartoony,GameMaker
    Cluster 3: FPS,Shooter,Third-Person Shooter,Sniper,Third Person,Survival,Classic,Gore,Sci-fi,Aliens,First-Person,Stealth,Assassin,Hunting,Futuristic,Cyberpunk,Destruction,Mechs,Robots,Lara Croft,Dinosaurs,Parkour,3D Vision,Zombies,Survival Horror,Bullet Time,Arena Shooter,Post-apocalyptic,Inventory Management,Star Wars,6DOF,Heist,Transhumanism,Gun Customization,Mars
    Cluster 4: Mod,Mods,Mods (require HL2),Mods (require HL1)
    Cluster 5: Design & Illustration,Tutorial,Education,Animation & Modeling,Animation &amp; Modeling,Video Production,Utilities,Web Publishing,Game Development,Software Training,Design &amp; Illustration,Audio Production,Photo Editing,Accounting
    Cluster 6: HTC Vive,Oculus Rift,Tracked Motion Controllers,Room-Scale,VR,Seated,Standing,SteamVR Collectibles,Keyboard / Mouse,Gamepad,Windows Mixed Reality,360 Video
    Cluster 7: Character Customization,Open World,Crafting,Swordplay,Hack and Slash,Action RPG,Medieval,Pirates,Dragons,Voxel,Sailing
    Cluster 8: Card Game,Trading Card Game,Turn-Based,Board Game,Turn-Based Strategy,4X,Turn-Based Tactics,Warhammer 40K,Games Workshop,Hex Grid,Tactical RPG,Turn-Based Combat,Strategy RPG,Asynchronous Multiplayer,Chess
    Cluster 9: Female Protagonist,Nudity,Anime,Choices Matter,Multiple Endings,Romance,Visual Novel,Sexual Content,Interactive Fiction,Dating Sim,RPGMaker,Choose Your Own Adventure,Text-Based,Otome
    Cluster 10: Tactical,War,Rome,Historical,Wargame,Cold War,Real-Time with Pause,RTS,Diplomacy,World War II,Alternate History,Real-Time,Grand Strategy,Real Time Tactics,Military,Naval,Tanks,America,World War I,Modern
    Cluster 11: 1990's,Story Rich,Atmospheric,Silent Protagonist,Linear,Mystery,Experience,Psychological Horror,Horror,Exploration,Point & Click,Underwater,Lovecraftian,Demons,Detective,Supernatural,Steampunk,Dystopian,Dark,Mature,Noir,Cinematic,FMV,Cult Classic,Based On A Novel,Surreal,Short,Walking Simulator,Psychological,Time Travel,Hand-drawn,Experimental,Quick-Time Events,Conspiracy,Narration,Dynamic Narration,Lore-Rich,Conversation,Nonlinear,Philisophical,Mystery Dungeon
    Cluster 12: Realistic,Driving,Trains,TrackIR
    Cluster 13: Captions available,Episodic,Crime,Benchmark,Movie,Thriller,Werewolves,Documentary,Martial Arts,Drama,Gaming,Foreign,Feature Film,Hardware,Faith
    Cluster 14: Fantasy,Dark Fantasy,Gothic,Isometric,Vampire,Magic,Mythology,Villain Protagonist,CRPG,Dungeon Crawler,JRPG,Kickstarter,Investigation,Crowdfunded,Grid-Based Movement,Voice Control
    Cluster 15: Casual,Physics,Science,Clicker,Puzzle,Music,Hidden Object,Match 3,Touch-Friendly,Family Friendly,Level Editor,Abstract,Relaxing,Mouse only,Music-Based Procedural Generation,Rhythm,Minimalist,Hacking,Lemmings,Sokoban,Typing,Programming,Artificial Intelligence,Word Game,Spelling,Steam Machine
    Cluster 16: LEGO,Batman,Superhero,Comic Book
    Cluster 17: Party-Based RPG,Software
    Cluster 18: Sports,Racing,Golf,Horses,Offroad,Bowling,Mini Golf,Football,Soccer,Gambling,Basketball,Cycling,Pool,Wrestling
    Cluster 19: Free to Play,PvP,Competitive,In-App Purchases,Multiplayer,Massively Multiplayer,MMO,Online Co-op,MMORPG,Online Co-Op,Team-Based,Includes Source SDK,Class-Based,PvE,MOBA,e-sports
    Cluster 20: Local Co-op,Local Multi-Player,Co-op Campaign,Local Co-Op,Local Multiplayer,Fighting,Split Screen,2D Fighter,4 Player Local
    Cluster 21: Top-Down,Top-Down Shooter,Loot,Shoot 'Em Up,Twin Stick Shooter,Bullet Hell,Rogue-like,Procedural Generation,Rogue-lite,Perma Death
    Cluster 22: Funny,Comedy,Satire,Dark Humor,Memes,Parody,Illuminati,Intentionally Awkward Controls,NSFW,Dark Comedy
    Cluster 23: Bikes

    K-means does a pretty good job with grouping attributes into meaningful clusters. One downside you might notice that the cluster assignments will not be consistent when the snippet is run subsequent times. Because the initial centroids used by K-means are choosen at random, we will get different cluster outputs on the same dataset. We can work around this issue by performing the clustering several times and choosing the results that have the lowest sum of squared error.

    That’s all folks!

    In my next post, we will use the attribute clusters to make a new version of the recommender. We will then explore methods for evaluating how good of a job our recommenders do with recommending video games to users. You can find the code for the entire solution here.

  • High Dimensional Clustering 101

    High Dimensional Clustering 101

    High dimensional data are datasets containing a large number of attributes, usually more than a dozen. There are a few things you should be aware of when clustering datasets such as these. The goal of this post is to highlight a few strategies you can use when performing high dimensional clustering.

    Setting the Stage

    To set the stage for the discussion take a look at the image below.

    Which one of the red points would you say is the closest to the x in the center? As you can see, there’s no clear answer. Some of you reading this may even say that is a nonsensical question to be asking, because these points are roughly equidistant from x. Even if there is a single point that is closer than the others, it would only be by a measly, negligible amount.

    When a data clustering algorithm looks at high dimensional data, the data points could look much like the above image. The results we get back from the algorithms may not capture the patterns present in the dataset. This is known in machine learning as the curse of dimensionality.

    Dum Dum Duuuuuuuuumm… The curse of dimensionality

    The curse of dimensionality is a very interesting, and counterintuitive phenomenon that arises when machine learning algorithms are applied to datasets with a large number of features.

    Data clustering algorithms work by computing distances between data points and grouping together points that are close together in proximity. When the number of features in a dataset is small, the algorithms are able to clearly the data points that are close together from the ones that are not. For datasets with many features, the task of identifying points that are closer together is more challenging. The reason for this is because the distance between points in high dimensions are far away. This is true even for points that are similar to each other.

    Pictured below are two charts. The chart on the left is a 1-dimensional plot of several data points. The chart on the right is a histogram of pairwise distances between each data point pictured in the first plot.

    As shown in the plot on the left, the dataset can be neatly grouped into three clusters. You can also see this from the three peaks in the histogram on the right.

    Here’s what the charts look like when we add a second dimension to the data.

    With the addition of the second dimensions, the distances between each point is larger. Nevertheless a clustering algorithm can still easily cluster these points into 3 groups.

    The same observation can be made if a third dimension was added to the data.

    As more dimensions are added to the dataset the distances between the points become larger and the peaks we observed in the histogram become narrower. After 100,000 dimensions we’ll get a histogram shaped like the one below:

    Euclidean distance of datapoints on a dataset with 100,000 dimensions.

    Wait, isn’t there supposed to be three peaks? At 100,000 dimensions, the distance between the data points became so large that the third cluster we saw in the first 3 plots became indistinguishable from the other two. If we were to run K-means on this dataset, the clusters it would give us would definitely not be what we’d expect.

    Now that you are aware of the curse of dimensionality I will briefly highlight in the next few sessions some methods that you can use to work around it.

    Use Feature Selection

    You don’t need to cluster with every attribute in your dataset. If you’re only interested in a handful of attributes, you can cluster on just those attributes and discard the others. You should also examine your dataset for features that show very little variation or are highly correlated with other features. Those features will not provide any meaningful information to a clustering algorithm and can therefore be discarded.

    Use Dimensionality Reduction

    Another approach for clustering high dimensional data entails projecting the dataset to a lower dimensional space before applying a clustering algorithm. Oftentimes a dataset can be approximated very well using a smaller number of dimensions without losing much information. This can be accomplished by applying techniques from linear algebra such as Principle Component Analysis and Singular Value Decomposition. I will explore both topics more in detail in future posts.

    Use a Graph-based Clustering Algorithm

    Graph based clustering algorithms such as Spectral clustering uses a different criteria for forming clusters that is more robust in high dimensions than K-means and other related methods. I will explore spectral clustering more in detail in a future post.

    That’s all folks!

    In my next post, I explore the topic of dimensionality reduction further by discussing the nuts and bolts of Principle Component Analysis. Stay tuned.

  • Let’s Get Fuzzy with C Means Clustering

    Let’s Get Fuzzy with C Means Clustering

    Not every dataset you will encounter in the wild will have data points that can be neatly placed in a single cluster. You may come across data points that could potentially belong to two, three, or possibly more clusters.

    Consider this scenario. You work as a data analyst for a company that has developed a fitness accountability app. During the first week of January, the company experienced a large number of new app installs. These installs are presumably from people looking to fulfill their new year resolution fitness goals. At the end of the month you gathered a week’s worth of activity data from several of these new users. Below is a scatter plot of this data.

    Data points in red represent users who deleted the app after 1 week of usage. The data points in green are users who are still actively using the app. From the plot it is clear that there’s a relationship between level of app engagement and likelihood to churn. Users furthest to the left are more likely to delete the app after a week, while users furthest to the right are more likely to remain active on the app. A high degree of uncertainty, however, exists for users in the middle of the plot. These users could either churn or remain active. 

    Let’s now suppose you were given the task to group these users into a “churn” and “not churn” categories. You could use a clustering method like K-means to group the users, but information about the degree of uncertainty in the categories would be lost. A new clustering method is needed that would allow you to model the degree a data point is similar to or dissimilar from users that churn and remain active.

    Enter Fuzzy C Means Clustering

    C Means is a k-means variation that assigns each data point probabilities of belonging to each cluster. C Means differs from k-means in two key ways:

    The first difference is that c means constructs a table (i.e. matrix) that contains the probability that each datapoint belongs to a particular cluster. Here’s what the table would look like for a few data points in the problem outlined above:

    UserChurnNot Churn
    A0.250.75
    B0.910.09
    C0.510.49
    D0.170.83

    Each entry in the table has a number between 0 and 1. The larger the number, the more likely that data point belongs to the cluster. The sum of probabilities for each datapoint must also equal 1. If you look at the probabilities for each data point in the data table above, you will see that they indeed do.

    This table, which we will call the fuzzy partition table, is randomly initialized at the start of c-means algorithm and then repeatedly updated, along with the centroid. Each entry in the table is updated using the following equation:

    dist(xi,cj) and dist(xi,cq) represents the Euclidean distance between a datapoint xi and a cluster centroid cj and between xi and cq. You may have also noticed the m exponential term in the numerator and denominator. This is a special parameter that controls the fuzziness of the results. When m is 0, the equation will produce a binary output (1 or 0), akin to what you’d get if K-means was applied. As a matter of fact, you can think about K-means as C-means with the m parameter set to 0.  The larger m is the fuzzier the output is.

    The second key difference lies in the calculation of the centroid on each iteration of the algorithm. The formula is as follows

    If you read my previous post on weighted kmeans, you’ll notice that this is essentially the weighted mean.

    With that, here’s the steps of the C means algorithm:

    1. Choose c, the number of clusters, and m
    2. Randomly select c data points from the dataset to be initial centroids
    3. Randomly choose values for each entry in the fuzzy partition table, subject to the constraint that the values sum to 1.
    4. Compute new centroids
    5. Update the fuzzy partition table
    6. Repeat step 4 and 5 until either there’s no change in the values in the table or the difference is smaller than some threshold.

    Choosing the C in C-Means

    For C means you can use the Elbow method to find the optimal value for c. Below is a modified version of SSE to use when choosing the optimal value for c.

    That’s all folks!

    In my next post I will discuss methods for clustering high dimensional data.

  • Scaling your data before clustering

    Scaling your data before clustering

    You will often need to perform some form of preprocessing on your dataset before running a data clustering algorithm. In this post I will introduce one important preprocessing method that is often overlooked when clustering data. That method, my friends, is called data scaling.

    What is data scaling?

    As you may already know, clustering algorithms work by computing distances (i.e. dissimilarities) between data points in the dataset and grouping together points that are close in proximity. The method used for calculating the distance will be different depending on the algorithm used. One trait most of these methods share is high sensitivity to attributes (also called features) with numerically larger values compared to other attributes in a dataset. These features will bias the distance calculations and can cause the clustering algorithms to produce subpar clusterings.

    To illustrate this, let’s say you have a dataset containing SAT and GPA scores for 5000 college students that you’d like to segment into 4 groups.

    StudentSATGPA
    114191.35
    213841.16
    37843.58
    47483.31
    511240.98
    49966832.02
    49977393.58
    49987413.21
    499910881.01
    50005951.60

    After running K-means on the dataset, you create a scatter plot showing the student scores and the clusters each student is assigned to.

    It’s immediately apparent looking at the plot that a large number of students were mislabeled by the algorithm. SAT scores have a significantly larger numerical scale (400 to 1600) than that of grade point averages (0 to 4). SAT scores therefore have a much larger influence on the  distance measurement K-means uses to group the students.

    As can be seen from the plot, K-means leaned heavily on the SAT scores to cluster the data. To ensure that both GPA and SAT has an equal weight in the clustering both features need to be transformed so they are on the same scale. This is exactly what data scaling is for.

    Here’s the scatter plot we get after scaling the data set and performing k-means:

    Student clusters after applying data scaling

    Data Scaling Techniques

    There are two main methods for scaling data. The first method is called normalization and the second method is called standardization.

    Normalization

    Normalization uses the minimum and maximum to transform features onto the same scale. Below is the formula used to normalize each feature:

    Here’s what the student scores dataset would look like when each feature is normalized:

    StudentSATGPA
    10.8491670.3375
    20.8200000.2900
    30.3200000.8950
    40.2900000.8275
    50.6033330.2450
    49960.2358330.5050
    49970.2825000.8950
    49980.2841670.8025
    49990.5733330.2525
    50000.1625000.4000

    Normalization rescales the dataset features between 0 and 1 or -1 and 1. Normalization is typically used when the numerical distribution of the features is not known. Normalization fails however on datasets with outliers. Allow me to explain why.

    Suppose you have a small dataset containing the following points.

    XY
    01
    23
    53
    1013
    1517
    2030
    2223
    2419
    99025
    100027

    As you can see from the table, the last two data points are outliers. Using normalization, the data points will transform into the following

    XY
    0.0000.000
    0.0020.083
    0.0050.083
    0.0100.500
    0.0150.666
    0.0200.458
    0.0220.583
    0.0240.708
    0.9900.833
    1.0001.000

    Even though normalization shifts the values between 0 and 1, the outliers still remain as outliers. If we were to compare these points using a distance function, the [] feature would have more influence on the measurement.

    Standardization

    Standardization uses the mean and standard deviation to transform a dataset. The formula for transforming each feature is shown below:

    Here, µ is the mean of the feature and σ is the standard deviation of the feature. This is what the same student score dataset from earlier would look like when standardized.

    StudentSATGPA
    11.393781-0.447397
    21.283359-0.634040
    3-0.6095871.743211
    4-0.7231631.477980
    50.463083-0.810861
    4996-0.9282320.210768
    4997-0.7515571.743211
    4998-0.7452481.379747
    49990.349506-0.781391
    5000-1.205864-0.201813

    Standardization is very helpful in cases where the features in the dataset are normally distributed. But your data doesn’t necessarily have to be normally distributed in order to use standardization. Unlike normalization, standardization does not squeeze your data into a bounding range, meaning that it will still work if there’s outliers present.

    That’s all, folks!

    There are many other data scaling techniques that can be used, but normalization and standardization are the two methods that are mainly used in practice. In my next post I will discuss fuzzy clustering.

  • Weighing your k-means data clustering

    Weighing your k-means data clustering

    In my previous post on the K-Prototypes algorithm, I introduced the concept of parameter weighting. With just a single parameter, you can control the amount of influence categorical attributes has on the data clustering. In this post, I’m going to expand on this concept and show how we can control the weightings of entire data points. There are some data clustering problems you will encounter in the real world where this technique is essential.

    Customer Segmentation Case Study

    Consider the following scenario. You were hired by a growing online distributor of flavored milk to help them understand the shopping habits of their customers. The company would like to segment their customers into different groups. In particular, the company would like to understand the representative spend share for each customer group it can effectively firm up its marketing strategy. You are handed a dataset containing the percentage of sales in five different milk products over the past year.

    CustomerIDOrganic MilkCondensed MilkAlmond MilkCoconut MilkSkim Milk
    113.4526.0511.144.9927.44
    224.7319.5216.583.4923.65
    325.5020.4718.142.6623.49
    44.520.107.369.2713.47
    526.4721.0918.41.9021.75
    69.4315.3711.891.7825.81
    74.8321.115.489.4713.80
    85.7618.097.429.9111.25
    95.3618.274.849.9713.01
    107.9920.176.637.912.39

    Customer segmentation problems such as this one is an excellent use case for K-means clustering. The centroids (i.e. center points) produced by k-means can provide the representative customer characteristics that your client is looking for. Taking a look at the distribution of total spend for each customer, you notice that the numbers are heavily skewed.

    Some customers a high spend share in one or more categories, but their total spend is so low that they are less likely to respond to the company’s marketing campaign. Because K-means weighs each data point evenly, it will not take the customer total spend into account and will produce clusters and center points that are misleading.

    In order to get the output we’re looking for, we need to weigh each customer differently. Customers who have a higher total spend should have more influence on the cluster centroids than customers who do not. This is exactly what weighted k-means clustering allows us to do.

    About Weighted K-means clustering

    To perform weighted k-means clustering, we need to make a minor tweak to the way the cluster centroids are calculated after each iteration. Instead of using the mean the calculate the centroids, we use the weighed mean:

    Each data point xi in the dataset is assigned a weighted value wi. Computing the weighted mean for a cluster is just a matter of:

    1. Multiplying each data point by its weight
    2. Computing the sum the products calculated in step 1 together, and
    3. Dividing the sum calculated in step 2 by the sum of all the weights.

    As usual, here’s the weighted k-means clustering procedure, taking in account the new centroid calculation discussed above:

    1. Choose k. Determine the number of clusters to form
    2. Select initial centroids. In this step the algorithm chooses k data points from the dataset at random to be the initial centroids.
    3. Assign data points to a centroid.  In this step the algorithm calculates the Euclidean distance between each data point and each centroid. Data points are assigned to the centroid with the smallest distance.
    4. Calculate weighted centroids. New centroids are calculated by finding the weighted mean of each cluster
    5. Repeat steps 3 and 4. Repeat steps 3 and 4 until no changes in the assignments are made.

    Weighted K-means and Outliers

    Weighed k-means clustering is also useful when clustering datasets with outliers. By assigning small weight values to outlier data points, k-means will form clusters that are robust to these extreme values.

    That’s all folks!

    I hope you found this guide on weighted k-means useful. In my next post, I will discuss data scaling, an important but often overlooked concept that has a huge impact on the data clustering algorithms.

  • Mixed Data Clustering using K-Prototypes

    Mixed Data Clustering using K-Prototypes

    This blog has thus far covered a number of data clustering algorithms. A majority of the algorithms operate under the assumption that the dataset to be clustered is purely numerical. The k-Modes algorithm was the one method we’ve covered that works on purely categorial data. What if you want to cluster a dataset that contains both categorial and numeric data. That’s exactly what the k-prototypes algorithm allows you to do. 

    What is a Prototype?

    In K-Means, the centroids are found by calculating the mean of each cluster’s data points. In K-Modes, these are derived from the mode. For datasets with mix of categorical and numerical data, the centroids are found by calculating the mean of the numeric data and taking the mode of the categorial data. A prototype is simply a combination of the mean and the mode.

     CityEye ColorAgeScore
     MiamiBlue3293
     MiamiGreen2532
     ChicagoBlack2772
     MiamiBlue2163
    PrototypeMiamiBlue26.2565

    The function that is used to check how similar or different each data point is to a prototype is also a combination of the Euclidean distance from K-Means and the matching function from K-Modes:

    Let’s take this function apart and examine what each piece means:

    The red part of the function above is the k-means Euclidean measurement, and the green part is the k-modes matching dissimilarity measurement.

    The symbols p and q in the function above represent two individual data points.

    N and C represent sets of numeric and categorial variables and n and c represent a variable from both sets. The symbols pn, qn, pc, qc represent an numeric and categorial values.

    Something new you might have also noticed is the gamma term:

    This term controls the amount of weight categorial variables contribute to the output.

    A large enough value for gamma would result in more focus being placed categorial variables. As a result, a point could be assigned to a cluster where the cluster’s prototype is not nearest to it in terms of numeric variables, but the categorial values of most of the other points in the cluster matches that point.

    Here’s an example of the dissimilarity function in action using two data points:

     CityEye ColorAgeScore
    pMiamiBlack3293
    qChicagoBlack2772

    From the function above, N would represent a set containing the numeric variables {Age, Score}, and C is a set containing categorial variables {City, Eye Color}.

    Calculating the Euclidean (red) part of the function, we get the following:

    For the matching (green) part of the function, lets suppose that gamma is set to 1.5. Here’s what we’d get:

    Adding the results from both calculations together, the final result is 23.08.

    Here’s the step by step breakdown of the k-prototypes procedure, taking into account everything covered so far:

    1. Choose k and gamma. Determine the number of clusters to form and the amount of weight categorial variables should have on the clustering.
    2. Select initial prototypes. In this step the algorithm chooses k data points from the dataset at random to be the initial prototypes.
    3. Assign data points to a prototype.  In this step the algorithm calculates the dissimilarity between each data point in the dataset and each prototype and assigns the data point to the prototype it is closest to.
    4. Calculate new prototypes. A new prototype is calculated for each cluster using the dissimilarity function described earlier.
    5. Repeat steps 3 and 4. The algorithm reassigns data points to clusters based on how close they are to the new prototypes. After the reassignment new prototypes are computed. This process is repeated until no changes in the assignments are made.

    Determining Optimal Number of Clusters

    Like all the other methods covered before this one, the Elbow method can be used to determine optimal number of clusters. Here’s the new cost function you will be using for the elbow plot:

    As you might have guessed, this is simply a combination of the sum of squared error from k-means and the total dissimilarity score from k-modes.

    But wait, what about optimal values for gamma?

    With the addition of the gamma term, you might be wondering if there’s a method for finding good values for that parameter. Finding optimal values involves pure experimentation. You perform k-prototypes on your dataset several times with different values and choose a value that weighs the categorical and numeric attributes in the way you expect.

    If you’re looking for a quick and dirty value that provides a decent balance between the categorical and numeric attributes in your dataset, a good heuristic you can use is the standard deviation of the numeric attributes in the dataset.  Z. Huang, in his early experiments with the k-prototypes algorithm, found a suitable value to lie between 1/3 and 2/3 of the standard deviation of the numeric data. Here’s a link to his paper where he describes the experiments more in detail.

    That’s all, folks!

    I hope you enjoyed this tutorial. One thing that was new with this segmentation method was the usage of a gamma paramter to control the influence variables have on the output. In my next post I will cover another clustering algorithm that also makes use of weights. Stay tuned.

  • Mediods Data Clustering with CLARANS

    Mediods Data Clustering with CLARANS

    K-Medoids is a data clustering algorithm that is commonly used on datasets containing extreme values. In previous posts on the subject, I covered two implementations of the algorithm that can be used to find medoids in a dataset.

    The first method, called PAM, scans every data point in the dataset for k medoids. PAM can accurately locate the medoids in the dataset, but at expense of efficiency on large datasets.

    CLARA was the second method I covered. It’s able to achieve superior performance on large datasets by restricting the scan to a smaller randomly sampled portion of the dataset. An issue that may arise with CLARA, however, is that the true medoids of the dataset might get excluded from the sample, causing the method to return sub-optimal results.

    This is where the next algorithm I will discuss comes into play. This algorithm, like CLARA, does not check every data point in the dataset. But unlike CLARA, its scan is not restricted to a small sample of data points, but at random neighborhoods of points selected at multiple iterations. This method is called CLARANS, which is short for  Clustering LARge Applications based on RANdomized Search.

    The Search For the Best Medoids Data Clustering

    The CLARANS clustering algorithm starts out just like any other k clustering method. K data points are chosen at random to be the medoids that all the other data points in the data set will cluster around. From there, CLARANS searches for neighboring medoids that cluster the data better. A better clustering is defined as a set of medoids where the total dissimilarity between each medoid and the data points assigned to it is as small as possible.

    For the sake of making this process CLARANS takes to search for better medoids easier to understand, I want you to imagine that you are standing in a large room littered with dozens of water filled holes.  You and your friends are playing a game where you are tasked with finding the deepest hole in the room. The catch is that this has to be done blindfolded.

    Each hole in the room represents a set of k data points from the dataset chosen at random.

    After your friends place you at one of the holes, you begin your task for looking for the deepest hole in the room. Because you’re unable to determine this visually, you do this by sticking your bare foot into the holes and measuring the depth by how high the water rises up your leg. The depth a hole represents the total cluster dissimilarity for that set of k medoids.

    You make a mental note of  the depth of the hole that’s in front of you and then randomly choose one of the nearby holes to measure next.

    In the case of the CLARANS clustering algorithm, two sets of k medoids are neighbors if they share all but one medoid.

    Neighboring set of k-medoids, where k = 7

    CLARANS selects a neighbor by randomly selecting one of the k medoids from the dataset and replacing it with a data point from the dataset. From there it computes the total cluster dissimilarity of the neighbor and compares it with the dissimilarity of the current set of medoids. 

    CLARANS searches several neighbors, keeping track of the best set of mediods it finds thus far. The maximum number of neighbors that is searched is actually predetermined at the beginning of the algorithm. We’ll call this parameter maxneighbors. Each time CLARANS finds a better set of medoids, it searches maxneighbors more neighbors to see if it finds an even better set. If it fails to find a better clustering after examining additional neighbors, the search is concluded. The set of medoids CLARANS settles on is known as a local minima.

    The following flow chart below summarizes the local search described above:

    CLARANS data clustering local search flow diagram

    From Local Search to Global Search

    Going back to our toy scenario, after searching several neighboring holes, you were able to find the deepest hole within a the small region of holes that you searched. That hole you found, may not be the deepest hole in the room.

    To truly determine this, you decided to repeat your local neighborhood search at other random holes in the room, and compare the deepest hole you found with what you just found. With that your friends proceed to move you to other random holes in the room where you can conduct additional searches for local minimas.

    Translating this process to the CLARANS algorithm, the number of local searches that is performed is another parameter that is determined at the start of the algorithm. We’ll call this number numlocal. After performing several of these local searches, CLARANS has a pretty good approximation of the true medoids in the dataset.  

    The entire CLARANS data clustering process can be summarized with the following steps:

    1. Specify parameters numlocal, k, and maxneighbors.
    2. Let B represent the best k mediods discovered so far.
    3. Let f represent the number of local searches to perform. Set this number to 1.
    4. Let H represent  k randomly selected medoids from the data set.
    5. Let c represent the number of neighbors inspected. Set this number to 1.
    6. Let s represent a random medoid from H. Let j represent a data point from the dataset that is not in H. Let N equal a set of k medoids that share the same points as H, except s is replaced by j.
    7. If the total cluster dissimilarity in N is less than that of H, set H to N and go to step 5.
    8. Increment c by 1. If c < maxneighbors, go back to step 6.
    9. If the total dissimilarity of H is less than that of B, set B to H.
    10. Increment f by 1. If f  < numlocal, go back to step 4.
    11. Output B.

    That’s all, folks!

    In the next post, we will be covering a method that can be used to cluster datasets with a mixture of numeric and categorical data. Stay tuned.

  • Faster Mediods Data Clustering with CLARA

    Faster Mediods Data Clustering with CLARA

    Previously I introduced K-Medoids clustering, a data clustering method that can be used to cluster datasets containing outliers. The Partition Around Medoids (PAM) algorithm I described in my post is computationally slow and expensive and is typically used on small datasets because of that. What if we have a very large dataset that we’d like to cluster using K-Medoids. Do we have no other choice but to sit around a wait for the time consuming clustering method to finish? Not at all! In this post I will introduce you to CLARA, a K-Medoids implementation that can be on large datasets. Let’s get started!

    CLARA Data Clustering .. PAM, But with A Twist

    CLARA is short for Clustering LARge Applications. The PAM Algorithm computes medoids  by finding the data point with the shortest distance from all the other points in its cluster. Computing medoids using PAM is very quick for small datasets. But when it comes to much larger datasets, this could take quite a bit of time. 

    CLARA addresses this issue by performing PAM on a randomly selected sample of data points. We will call the size of this smaller dataset s. The idea behind this approach computing the medoids of this sample would give us an approximation of the medoids if the entire dataset was clustered, assuming that data points from the sample is representative of the entire dataset .

    This clustering of sampled data points is performed not once, but several times. In order to ensure that the best possible clustering is achieved, CLARA draws several random samples of data points, performs PAM on each, and then determines the clustering that produces the best results. We will call the number of random samples and clustering performed n.

    How does CLARA determine which clustering result is the best, you ask?

    That all depends on the dissimilarity (i.e. distance) method that was used to form the clusters. If Euclidean distance was used, than best is defined as the result with the smallest sum of squared error. If Manhattan distance was used, then the total absolute difference is used. See my articles on K-Means, and K-Medians for more information about these measures.

    The medoids from the best clustering is then used to assign the data points from the entire dataset to the appropriate cluster.

    Dataset before CLARA

    Step 1: Sample data points

    Step 2: Compute sample medoids
    Step 3: Assign data points to medoids

    With that, the entire process is outline in the following steps

    1. Select n samples of the dataset, each with the size s.
    2. Perform PAM on each sample
    3. Determine the PAM clustering that produces the most optimal results.
    4. Use the selected PAM clustering to assign each datapoint from the entire dataset to the appropriate centroid.

    With the CLARA implementation, we have an approach for efficiently computing approximate medoids for our dataset. There are some disadvantages to this approach:

    The quality of the results returned by CLARA is heavily dependent on the data sampled. If it turns out that there was some bias in the selection of the data  points, then the clustering results will not be good.

    The effectiveness of CLARA also depends on the sample size. CLARA can’t produce a good clustering if any of the medoids sampled is far from the actual medoids. This will most definitely the case if the sample sizes are too small. On the other hand, sample sizes that are too large would result in longer processing times. Therefore, when using CLARA one must choose a sample size that provides an acceptable tradeoff between speed and accuracy.

    Coming up next

    My next post will cover the CLARANS data clustering algorithm, another K-Medoids implementation. Until next time.

  • Outlier Clustering with K-Medoids

    Outlier Clustering with K-Medoids

    In my last post, I introduced you to the K-Medians algorithm, a data clustering method that can be used to cluster datasets with outliers. In this post I will be introducing you to another k-means variation that is also effective on datasets with outliers. It’s called K-Medoids. As a matter of fact, there are three variations of this clustering method, each with its own advantages and disadvantages. I will be covering just one of these variations in this post. The other two will be covered in subsequent posts.

    Introducing the K-Medoids Algorithm

    The K-Medoids algorithm, much like the other variations we’ve covered thus far, works and functions similarly to k-means. What makes it different is:

    • Unlike the other variations we’ve covered up to this point, any distance function can be used to compute dissimilarities between the data points and the cluster centroids. That means you can use Euclidean distance and Manhattan distance. I will be covering more distance functions that you can use for clustering in a future post.
    • Centroids are computed by finding the medoid of the data points in each cluster. A medoid is a data point from a data set whose average dissimilarity is the smallest. In other words the medoid is the most centrally located point in the dataset. The medoid, like the median, is not influenced by extreme data points. K-Medoids therefore does not suffer from the same limitations as k-means.
    PAM data clustering medoids

    Partitioning Around Medoids

    As I mentioned at the start of this post, there are three variations of k-Medoids clustering. The most common approach, and the one I will be explaining in this post is known as the Partitioning  Around Medoids (PAM) algorithm. The algorithm works as follows:

    1. Initialize. Randomly select k data points as centroids.
    2. Assign. Pick any distance method of your choosing and use it to compute the dissimilarity between each data point and the k centroids. Assign each data point to the closest centroid.
    3. Update. Perform the following for each cluster:
      • Let c represent the cluster centroid, and p represent a non-centroid data point. For each data point:
        1. Swap positions. That is, c becomes p and p becomes c.
        2. Use the distance function you chose in step 2 to compute the average distance between each data point and the centroid.

    After computing the average distance for each datapoint, the new centroid becomes the data point with the smallest average distance.

    Choosing optimal K

    The Elbow method and Sihouette method can be used to find the optimal value of k for K-Medoids. See this post for a description of these methods.

    Up Next…

    The PAM algorithm is able to accurately compute the medoids for each cluster. But this accuracy comes at a cost.  The algorithm is computationally time consuming and this therefore used when the size of the data set is relatively small. For much larger datasets we need to use more efficient methods. I will be discussing one of these methods in my next post. Until next time.