In my last post, we took the 380+ game video game attributes that we extracted from the Steam video game dataset and wrote an algorithm to cluster the attributes into 24 groups. In this post we will use the clusters to make an new and improved video game recommender. If you haven’t read my first and second post on the video game recommender, please read them before continuing.
Step 1: Reconstructing the game features table
The game features table that we built in the first post contained hundreds of attributes. We will construct a much smaller table using the game attribute clusters.
Game Features Table
The each column in the table indicates the magnitude of that attribute present in each game. The numbers are computed for each game using the attribute-cluster assignments we obtained from K-means. For each game, we count up the number of attributes that belong to each cluster. The construction of the new game features table is described in the code snippet below.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Using the game features table we built in step 1, we will rebuild a much smaller user features table.
User Features Table
Similar to the game features table, each column in the user features table indicates the degree each user prefers a game with that attribute. The numbers are computed for each user by retrieving the features for each game played by the user from the game features table and adding them up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
With our new game and user feature tables in place a new method for examining similarity between users is in order. In our first recommender, we used the matching dissimilarity score. In our new recommender, we’re going to use cosine similarity. Cosine similarity is a measure of similarity between two numerical vectors. It is the dot product between two vectors divided by the product of their lengths.
Let’s suppose that we have a user that we want to generate recommendations for. We’ll call the user in question u and the number of recommendations we’d like to generate x. We will use cosine similarity to find another user named v who’s preference is the most similar to u. We will then select x games from v‘s play history that has not been played by u and then recommend them
Here’s the new recommendation procedure in code form.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In a previous blog post, I walked through the creation of a simple recommender system that recommends video games to existing users on Steam. Since creating the recommender, my student and I have been exploring ways to improve it. One enhancement we’ve been looking at is speeding up the computational time of the recommender by clustering video game attributes (tags, genres, and specs) into smaller, more manageable groups. In this post I will describe how we utilized the K-means algorithm to do this.
Improving the Recommender by Computing Probabilities
The current system uses 188 categorical attributes to recommend video games to existing users. The biggest disadvantage of this approach is the large amount of computational time required to build the tables the recommender requires. The numerous categorical features also makes adding numerical features (such as game price) to the system challenging; the influence of the categorical features will outweigh any impact the numerical features could have on the recommendations. To correct both issues, we will attempt to reduce the number of attributes the recommender using K-Means clustering.
The raw, unprocessed video game metadata contains 380+ attributes. We observed from looking at the data for several video games that some attributes tend to appear together with other attributes. This gave me the idea to try to group these attributes based on the likihood the attributes will appear together. We will do this by constructing a square matrix that contains the conditional probability of observing any two attributes in a video game. We use this matrix to cluster the game attributes.
The code snippet below shows the construction of the probability matrix. We iterate through the each game in the steam games dataset and construct a dictionary containing the sets of games that have each attribute.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We use the attributes dictionary to build the probability matrix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The probability matrix computed in the last section has 381 dimensions. As discussed in my high dimensional clustering post, clustering data with very high dimensions could be problematic. To avoid these problems, we’re gonna apply dimensionality reduction.
The code snippet below uses the pca module provided by Sci-kit learn to perform Principal Component Analysis on the probability matrix. To determine the number of principal components to keep, we computed a cummulative sum of the explained variance ratios for each principal component. Check out my article on PCA for more details on how it works.
pca = PCA()
pca.fit(probability_matrix)
total = 0
for idx, r in enumerate(pca.explained_variance_ratio_):
total += r
print("{0} Components: {1}".format(idx, total))
We decided to keep just enough components to explain 70% of the variability in the data. That number happened to be 31. We will refit PCA to the dataset and reduce the dimensions.
# Refit PCA to the probability matrix and keep only the 31 principal components
pca = PCA(n_components=31)
pca.fit(probability_matrix)
feature_set = pca.transform(probability_matrix)
With the newly featurized dataset in place, we can now proceed with the data clustering.
Discovering Game Attribute Categories
We use the sihoulette method to find the optimial number of clusters for k-means. The code snippet below uses the Sci-kit learn implementation of k-means and silhouettee score to derive scores for different numbers of clusters. Check out this post for details on how the sihouette method works.
range_n_clusters = range(2, 28)
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters, random_state=25)
cluster_labels = clusterer.fit_predict(feature_set)
silhouette_avg = silhouette_score(feature_set, cluster_labels)
print(
"For n_clusters =",
n_clusters,
"The average silhouette_score is :",
silhouette_avg,
)
Using the snippet above we determined the optimal number of clusters to be 24. Last, but not least, we group the game attributes into 24 clusters and output the group assignments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
K-means does a pretty good job with grouping attributes into meaningful clusters. One downside you might notice that the cluster assignments will not be consistent when the snippet is run subsequent times. Because the initial centroids used by K-means are choosen at random, we will get different cluster outputs on the same dataset. We can work around this issue by performing the clustering several times and choosing the results that have the lowest sum of squared error.
That’s all folks!
In my next post, we will use the attribute clusters to make a new version of the recommender. We will then explore methods for evaluating how good of a job our recommenders do with recommending video games to users. You can find the code for the entire solution here.
As a code coach at theCoderSchool, I teach and guide young students in the development of software applications. Some apps are simple calculator apps, mad libs generators, and implementations of popular games like Tic-Tac-Toe and Connect Four. Other apps are as complex as web scrapers, networked multi-player space shooters, and Rubik’s cube solvers. Recently I’ve been working with one my advanced students on a video game recommender system. I immediately had the thought to document our process in a series of blog posts.
A recommender system helps users discover new products and services that users would otherwise not discover on their own. Companies like Amazon and Netflix use recommender systems to suggest new products and movies for their users to buy and watch. Recommender systems work by examining how similar items are to the ones used by users in the past.
In a series of blog posts I will guide you through the development of different recommender systems using the Steam Video Game Dataset. At the end of this post, you will learn how to create a very basic content-based recommender system that will recommend video games to existing users.
This post assumes that you have a solid understanding of Python and the pandas open source python module. If your knowledge of Python and pandas is shakey, make sure to brush up on them before proceeding.
Checking out the Steam Video Game Dataset
The Steam Video Game Dataset provides several JSON files that contains information about reviews on the Steam platform, user and item metadata, and item bundles. For the recommender system we will be building in this post, we will be using the User and Item Data and Item metadata JSON files.
The User and Item Data file contains information collected from over 5 million steam users. Pictured below is the JSON object structure for a single user.
The items element contains a list of video games played by a single user and the amount of time the user spent playing each game.
The items_count element provides the total number of games played by the user. The steam_id, user_id, and user_url are unique identifies for the user within steam platform.
The Item metadata file provide data 32,000 games on Steam. Below is the json object for one items in the dataset.
Now that you’re acquainted with the data that we will be using, I’ll now provide an overview on how the recommender will be built.
How the Recommender Will Work
We will develop a simple algorithm that will find video games that are similar in characteristics to games that user has played in the past. The JSON data files covered in the last section will be used to build two tables.
The first table, which we will call the game_features table contains the attributes of each game in the dataset.
game id
genre 1
…
genre n
tag 1
…
tag n
spec 1
…
spec n
xxxxxxxx
1
…
0
1
…
0
0
…
0
xxxxxxxx
0
…
0
0
…
1
1
…
1
xxxxxxxx
1
…
0
0
…
0
1
…
0
xxxxxxxx
0
…
1
1
…
1
0
…
1
xxxxxxxx
1
…
0
0
…
0
0
…
0
We will use the genres, tags, and specs fields from the item metadata file to create a set of binary features for each game. Games that have the particular attribute will have the value 1, otherwise the value will be 0.
The second table will contain preferred video game characteristics for each user. We’ll call this table the user_features table
user id
genre 1
…
genre n
tag 1
…
tag n
spec 1
…
spec n
xxxxxxxx
0
…
1
1
…
1
1
…
0
xxxxxxxx
1
…
0
0
…
0
1
…
0
xxxxxxxx
1
…
0
0
…
0
0
…
0
xxxxxxxx
0
…
0
0
…
1
0
…
1
xxxxxxxx
1
…
0
0
…
0
0
…
1
This table has the same structure as the game_features table. Each binary feature indicates whether or not the user has played a game that has that particular attribute.
Given an existing user, the algorithm wil recommend new games by performing the following actions:
Filter out all the games from the game_features table that was already played by the user.
Compute a similarity score between each game in the game_features table and the user’s preferred game characteristics. We will be dissimilarity scoring method covered in my post on the K-modes algorithm.
Return the top 10 games with the lowest dissimilarity score.
Now that you know how the algorithm will work, let’s start coding!
Loading the Steam Games Dataset
First things first. We will read the item data file, parse the JSON objects and create a Pandas data frame containing the fields from each object. Because the item metadata file contained improperly formatted JSON, the pandas read_json() function could not be used to create the dataframe. We will need to iterate through each json object indepedently and parse them using the python ast module.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here’s what some of the data from the resulting data frame looks like:
item metadata dataframe
Building the Steam Game Features Table
With the item metadata loaded, we can now build the game_features table. But before we do that, let’s examine the values of genres, tags, and specs attributes. The code snippet below creates a pandas series for each attribute.
genres = []
tags = []
specs = []
for idx in range(steam_games_df.shape[0]):
game_genre = steam_games_df.iloc[idx]['genres']
game_tags = steam_games_df.iloc[idx]['tags']
game_specs = steam_games_df.iloc[idx]['specs']
if game_genre:
genres.extend(steam_games_df.iloc[idx]['genres'].split(","))
if game_tags:
tags.extend(steam_games_df.iloc[idx]['tags'].split(","))
if game_specs:
specs.extend(steam_games_df.iloc[idx]['specs'].split(","))
genres_srs = pd.Series(genres)
tags_srs = pd.Series(tags)
specs_srs = pd.Series(specs)
Using the pandas series unique() method, we can obtain the unique values for each attribute.
As you can see, the attributes have an exteremly high cardinality. To address it, we will group infrequently occurring values for each attribute into an ‘other’ category. In future posts, we will explore other methods for dealing with categorical data with high cardinality. The code snippet below identifies the groupings for each attribute and then creates column names that will be used when we build the game_features table.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We will now use the game_features table built in the last section to create the user_features table. We will load the user items json file, retrieve the list of games played for each user. Each game will be cross-referenced with the game_features table to determine the user’s preferred game characteristics. We will also store the user’s play history for later usage when it comes time to recommend new games to the user. Because the user items json file contains data for over 5 million users, the code snippet below will take a considerable amount of time to execute. To speed things up, you can reduce the number of records that are parsed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
With the game_features and user_features table now in place, we can now code the recommender algorithm.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The code snippet above defines two functions. The dissimilarity_score function is the scoring function that will determine how similar a game is to a user’s game characteristics. The recommend_games function uses the dissimilarity_score function to provide the ids of games that are similar to a user’s preferences.
Here’s what we get when recommeding new games for the steam user ‘evcentric’
We can use the following code snippet to get the names of the titles
for game in recommended_games:
filtered_games = steam_games_df[steam_games_df.id == game]
game_name = filtered_games.iloc[0]['app_name']
print(game_name)
We get the following as output
CounterAttack
Call to Arms
BrainBread 2
Castle Crashers®
BattleBlock Theater®
Streets of Rogue
Niffelheim
Unturned - Permanent Gold Upgrade
ARM PLANETARY PROSPECTORS Asteroid Resource Mining
Bloody Trapland
That’s all, Folks!
We have created a basic video game recommender. There are definitely more enhancements we can make to the recommender in order to get better results. That is what we will be covering in the next series of posts. You can find the code for the entire solution here.