Tag: recommender systems

  • Improving the Video Game Recommender

    Improving the Video Game Recommender

    In my last post, we took the 380+ game video game attributes that we extracted from the Steam video game dataset and wrote an algorithm to cluster the attributes into 24 groups. In this post we will use the clusters to make an new and improved video game recommender. If you haven’t read my first
    and second post on the video game recommender, please read them before continuing.

    Step 1: Reconstructing the game features table

    The game features table that we built in the first post contained hundreds of attributes. We will construct a much smaller table using the game attribute clusters.

    Game Features Table

    The each column in the table indicates the magnitude of that attribute present in each game. The numbers are computed for each game using the attribute-cluster assignments we obtained from K-means. For each game, we count up the number of attributes that belong to each cluster. The construction of the new game features table is described in the code snippet below.

    # Fit the K Means clustering algorithm get the cluster assignments for each attribute
    km = KMeans(n_clusters=24, random_state=25)
    km.fit(feature_set)
    labels = km.predict(feature_set)
    attribute_assignments = pd.Series(labels, index=game_categories)
    Group attributes into list of clusters
    attribute_clusters = []
    for i in range(24):
    cluster = attribute_assignments[attribute_assignments == i]
    attribute_clusters.append(cluster.index.tolist())
    feature_columns = ['clust_'+str(i) for i in range(25)]
    game_features = []
    for idx in range(steam_games_df.shape[0]):
    # Obtain list of genres, tags, and specs
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    attributes = []
    data_row = {k:0 for k in feature_columns}
    data_row['id'] = steam_games_df.iloc[idx]['id']
    # Iterate through each entry in the lists and create the features
    if game_genre:
    attributes.extend(game_genre.split(','))
    if game_tags:
    attributes.extend(game_tags.split(','))
    if game_specs:
    attributes.extend(game_specs.split(','))
    attributes = set(attributes)
    if len(attributes) > 0:
    for attr in attributes:
    for i in range(len(attribute_clusters)):
    if attr in attribute_clusters[i]:
    data_row['clust_'+str(i)] += 1
    else:
    data_row['clust_24'] += 1
    game_features.append(data_row)
    game_features_df = pd.DataFrame(game_features)
    game_features_df = game_features_df.set_index('id')

    Step 2: Reconstructing the user features table

    Using the game features table we built in step 1, we will rebuild a much smaller user features table.

    User Features Table

    Similar to the game features table, each column in the user features table indicates the degree each user prefers a game with that attribute. The numbers are computed for each user by retrieving the features for each game played by the user from the game features table and adding them up.

    game_feat_dict = game_features_df.to_dict()
    # Read user items data file and build features table
    with open('/content/drive/MyDrive/VideoGameRecFiles/australian_users_items.json','r',encoding='utf8') as f:
    data = f.read()
    data = data.strip().split("\n")
    user_features = []
    # We need to keep track of all the games each user played so we can avoid recommending games that they have already played.
    user_play_list = {}
    for user_data in data:
    # The stdataset is not a properly formatted json file. Because of this we need to iterate through each individual JSON object and use
    # the ast module to parse the object.
    record = ast.literal_eval(user_data)
    data_row = {k:0 for k in feature_columns}
    data_row['user_id'] = record['user_id']
    play_list = []
    for item in record['items']:
    item_id = item['item_id']
    play_list.append(item_id)
    for col in feature_columns:
    if item_id in game_feat_dict[col]:
    data_row[col] += game_feat_dict[col][item_id]
    user_play_list[record['user_id']] = play_list
    user_features.append(data_row)
    user_features_df = pd.DataFrame(user_features)
    user_features_df = user_features_df.set_index("user_id").drop_duplicates()

    Step 3: Defining a new recommender function

    With our new game and user feature tables in place a new method for examining similarity between users is in order. In our first recommender, we used the matching dissimilarity score. In our new recommender, we’re going to use cosine similarity. Cosine similarity is a measure of similarity between two numerical vectors. It is the dot product between two vectors divided by the product of their lengths.

    Let’s suppose that we have a user that we want to generate recommendations for. We’ll call the user in question u and the number of recommendations we’d like to generate x. We will use cosine similarity to find another user named v who’s preference is the most similar to u. We will then select x games from v‘s play history that has not been played by u and then recommend them

    Here’s the new recommendation procedure in code form.

    def cosine_score(user1, user2):
    score = cosine_similarity(user1.values.reshape(1, -1), user2.values.reshape(1,-1))[0][0]
    return score
    def recommend_games(user_id, n=10):
    '''
    Given a user id, recommend games to that user. By default 10 games are recommended
    '''
    # Get user features
    user = user_features_df.loc[user_id]
    # Get games played by the user
    play_list = user_play_list[user_id]
    other_users = user_features_df[user_features_df.index != user_id]
    scores = other_users.apply(lambda user2: cosine_score(user, user2), axis=1).sort_values(ascending=False)
    rec_idx = 0
    recommended_games = user_play_list[scores.index[rec_idx]]
    recommended_games = list(filter((lambda gid: gid not in play_list), recommended_games))
    while (len(recommended_games) < n):
    rec_idx += 1
    additional_games = user_play_list[scores.index[rec_idx]]
    recommended_games.extend(list(filter((lambda gid: gid not in play_list), additional_games)))
    return recommended_games[:n]

    That’s all folks!

    You can find the code for this post here. Until next time!

  • Clustering Video Game Attributes

    Clustering Video Game Attributes

    In a previous blog post, I walked through the creation of a simple recommender system that recommends video games to existing users on Steam. Since creating the recommender, my student and I have been exploring ways to improve it. One enhancement we’ve been looking at is speeding up the computational time of the recommender by clustering video game attributes (tags, genres, and specs) into smaller, more manageable groups. In this post I will describe how we utilized the K-means algorithm to do this.

    Improving the Recommender by Computing Probabilities

    The current system uses 188 categorical attributes to recommend video games to existing users. The biggest disadvantage of this approach is the large amount of computational time required to build the tables the recommender requires. The numerous categorical features also makes adding numerical features (such as game price) to the system challenging; the influence of the categorical features will outweigh any impact the numerical features could have on the recommendations. To correct both issues, we will attempt to reduce the number of attributes the recommender using K-Means clustering.

    The raw, unprocessed video game metadata contains 380+ attributes. We observed from looking at the data for several video games that some attributes tend to appear together with other attributes. This gave me the idea to try to group these attributes based on the likihood the attributes will appear together. We will do this by constructing a square matrix that contains the conditional probability of observing any two attributes in a video game. We use this matrix to cluster the game attributes.

    The code snippet below shows the construction of the probability matrix. We iterate through the each game in the steam games dataset and construct a dictionary containing the sets of games that have each attribute.

    # Create a dictionary containing sets of games that have each attribute
    category_sets = {}
    for idx in range(steam_games_df.shape[0]):
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    game_id = steam_games_df.iloc[idx]['id']
    if game_genre:
    cat_genres = game_genre.split(",")
    for g in cat_genres:
    if g in category_sets:
    category_sets[g].add(game_id)
    else:
    category_sets[g] = set([game_id])
    if game_tags:
    cat_tags = game_tags.split(",")
    for t in cat_tags:
    if t in category_sets:
    category_sets[t].add(game_id)
    else:
    category_sets[t] = set([game_id])
    if game_specs:
    cat_specs = game_specs.split(",")
    for s in cat_specs:
    if s in category_sets:
    category_sets[s].add(game_id)
    else:
    category_sets[s] = set([game_id])

    We use the attributes dictionary to build the probability matrix.

    game_categories = list(category_sets.keys())
    probability_matrix = []
    for g in game_categories:
    prob_list = []
    for c in game_categories:
    game_intersection = category_sets[g].intersection(category_sets[c])
    prob_list.append(len(game_intersection) / len(category_sets[g]))
    probability_matrix.append(prob_list)
    probability_matrix = np.array(probability_matrix)

    Applying Dimensionality Reduction

    The probability matrix computed in the last section has 381 dimensions. As discussed in my high dimensional clustering post, clustering data with very high dimensions could be problematic. To avoid these problems, we’re gonna apply dimensionality reduction.

    The code snippet below uses the pca module provided by Sci-kit learn to perform Principal Component Analysis on the probability matrix. To determine the number of principal components to keep, we computed a cummulative sum of the explained variance ratios for each principal component. Check out my article on PCA for more details on how it works.

    pca = PCA()
    pca.fit(probability_matrix)
    
    total = 0
    for idx, r in enumerate(pca.explained_variance_ratio_):
      total += r
      print("{0} Components: {1}".format(idx, total))

    We decided to keep just enough components to explain 70% of the variability in the data. That number happened to be 31. We will refit PCA to the dataset and reduce the dimensions.

    # Refit PCA to the probability matrix and keep only the 31 principal components
    pca = PCA(n_components=31)
    pca.fit(probability_matrix)
    feature_set = pca.transform(probability_matrix)

    With the newly featurized dataset in place, we can now proceed with the data clustering.

    Discovering Game Attribute Categories

    We use the sihoulette method to find the optimial number of clusters for k-means. The code snippet below uses the Sci-kit learn implementation of k-means and silhouettee score to derive scores for different numbers of clusters. Check out this post for details on how the sihouette method works.

    range_n_clusters = range(2, 28)
    
    for n_clusters in range_n_clusters:
        clusterer = KMeans(n_clusters=n_clusters, random_state=25)
        cluster_labels = clusterer.fit_predict(feature_set)
    
        silhouette_avg = silhouette_score(feature_set, cluster_labels)
        print(
            "For n_clusters =",
            n_clusters,
            "The average silhouette_score is :",
            silhouette_avg,
        )

    Using the snippet above we determined the optimal number of clusters to be 24. Last, but not least, we group the game attributes into 24 clusters and output the group assignments.

    # Fit the K Means clustering algorithm get the cluster assignments for each attribute
    km = KMeans(n_clusters=24, random_state=25)
    km.fit(feature_set)
    labels = km.predict(feature_set)
    attribute_assignments = pd.Series(labels, index=game_categories)
    for i in range(24):
    cluster = attribute_assignments[attribute_assignments == i]
    print("Cluster {0}: {1}".format(i, ",".join(cluster.index)))
    Cluster 0: Moddable,Trading,City Builder,Building,Economy,Base Building,Sandbox,Management,Space,Political,Agriculture,Space Sim,Capitalism,Politics,Resource Management,God Game,Fishing,Mining
    Cluster 1: Action,Indie,Simulation,Strategy,Single-player,RPG,Multi-player,Online Multi-Player,Cross-Platform Multiplayer,Steam Achievements,Steam Trading Cards,Stats,Adventure,Full controller support,Downloadable Content,Steam Cloud,Steam Leaderboards,Partial Controller Support,Early Access,Shared/Split Screen,Valve Anti-Cheat enabled,Steam Turn Notifications,Co-op,Violent,Commentary available,Steam Workshop,Includes level editor,Western,Flight,Tower Defense,Game demo,On-Rails Shooter,Soundtrack,Pinball
    Cluster 2: 2D,Replay Value,Difficult,Pixel Graphics,Cute,Singleplayer,Great Soundtrack,Retro,Platformer,Side Scroller,Stylized,Arcade,Underground,Remake,Action-Adventure,Spectacle fighter,Character Action Game,Beat 'em up,Controller,Fast-Paced,2.5D,Ninja,Puzzle-Platformer,Time Attack,Colorful,3D Platformer,Psychedelic,Score Attack,1980s,Time Manipulation,Cartoon,Metroidvania,Blood,Runner,Cartoony,GameMaker
    Cluster 3: FPS,Shooter,Third-Person Shooter,Sniper,Third Person,Survival,Classic,Gore,Sci-fi,Aliens,First-Person,Stealth,Assassin,Hunting,Futuristic,Cyberpunk,Destruction,Mechs,Robots,Lara Croft,Dinosaurs,Parkour,3D Vision,Zombies,Survival Horror,Bullet Time,Arena Shooter,Post-apocalyptic,Inventory Management,Star Wars,6DOF,Heist,Transhumanism,Gun Customization,Mars
    Cluster 4: Mod,Mods,Mods (require HL2),Mods (require HL1)
    Cluster 5: Design & Illustration,Tutorial,Education,Animation & Modeling,Animation &amp; Modeling,Video Production,Utilities,Web Publishing,Game Development,Software Training,Design &amp; Illustration,Audio Production,Photo Editing,Accounting
    Cluster 6: HTC Vive,Oculus Rift,Tracked Motion Controllers,Room-Scale,VR,Seated,Standing,SteamVR Collectibles,Keyboard / Mouse,Gamepad,Windows Mixed Reality,360 Video
    Cluster 7: Character Customization,Open World,Crafting,Swordplay,Hack and Slash,Action RPG,Medieval,Pirates,Dragons,Voxel,Sailing
    Cluster 8: Card Game,Trading Card Game,Turn-Based,Board Game,Turn-Based Strategy,4X,Turn-Based Tactics,Warhammer 40K,Games Workshop,Hex Grid,Tactical RPG,Turn-Based Combat,Strategy RPG,Asynchronous Multiplayer,Chess
    Cluster 9: Female Protagonist,Nudity,Anime,Choices Matter,Multiple Endings,Romance,Visual Novel,Sexual Content,Interactive Fiction,Dating Sim,RPGMaker,Choose Your Own Adventure,Text-Based,Otome
    Cluster 10: Tactical,War,Rome,Historical,Wargame,Cold War,Real-Time with Pause,RTS,Diplomacy,World War II,Alternate History,Real-Time,Grand Strategy,Real Time Tactics,Military,Naval,Tanks,America,World War I,Modern
    Cluster 11: 1990's,Story Rich,Atmospheric,Silent Protagonist,Linear,Mystery,Experience,Psychological Horror,Horror,Exploration,Point & Click,Underwater,Lovecraftian,Demons,Detective,Supernatural,Steampunk,Dystopian,Dark,Mature,Noir,Cinematic,FMV,Cult Classic,Based On A Novel,Surreal,Short,Walking Simulator,Psychological,Time Travel,Hand-drawn,Experimental,Quick-Time Events,Conspiracy,Narration,Dynamic Narration,Lore-Rich,Conversation,Nonlinear,Philisophical,Mystery Dungeon
    Cluster 12: Realistic,Driving,Trains,TrackIR
    Cluster 13: Captions available,Episodic,Crime,Benchmark,Movie,Thriller,Werewolves,Documentary,Martial Arts,Drama,Gaming,Foreign,Feature Film,Hardware,Faith
    Cluster 14: Fantasy,Dark Fantasy,Gothic,Isometric,Vampire,Magic,Mythology,Villain Protagonist,CRPG,Dungeon Crawler,JRPG,Kickstarter,Investigation,Crowdfunded,Grid-Based Movement,Voice Control
    Cluster 15: Casual,Physics,Science,Clicker,Puzzle,Music,Hidden Object,Match 3,Touch-Friendly,Family Friendly,Level Editor,Abstract,Relaxing,Mouse only,Music-Based Procedural Generation,Rhythm,Minimalist,Hacking,Lemmings,Sokoban,Typing,Programming,Artificial Intelligence,Word Game,Spelling,Steam Machine
    Cluster 16: LEGO,Batman,Superhero,Comic Book
    Cluster 17: Party-Based RPG,Software
    Cluster 18: Sports,Racing,Golf,Horses,Offroad,Bowling,Mini Golf,Football,Soccer,Gambling,Basketball,Cycling,Pool,Wrestling
    Cluster 19: Free to Play,PvP,Competitive,In-App Purchases,Multiplayer,Massively Multiplayer,MMO,Online Co-op,MMORPG,Online Co-Op,Team-Based,Includes Source SDK,Class-Based,PvE,MOBA,e-sports
    Cluster 20: Local Co-op,Local Multi-Player,Co-op Campaign,Local Co-Op,Local Multiplayer,Fighting,Split Screen,2D Fighter,4 Player Local
    Cluster 21: Top-Down,Top-Down Shooter,Loot,Shoot 'Em Up,Twin Stick Shooter,Bullet Hell,Rogue-like,Procedural Generation,Rogue-lite,Perma Death
    Cluster 22: Funny,Comedy,Satire,Dark Humor,Memes,Parody,Illuminati,Intentionally Awkward Controls,NSFW,Dark Comedy
    Cluster 23: Bikes

    K-means does a pretty good job with grouping attributes into meaningful clusters. One downside you might notice that the cluster assignments will not be consistent when the snippet is run subsequent times. Because the initial centroids used by K-means are choosen at random, we will get different cluster outputs on the same dataset. We can work around this issue by performing the clustering several times and choosing the results that have the lowest sum of squared error.

    That’s all folks!

    In my next post, we will use the attribute clusters to make a new version of the recommender. We will then explore methods for evaluating how good of a job our recommenders do with recommending video games to users. You can find the code for the entire solution here.

  • Building a Video Game Recommender System

    Building a Video Game Recommender System

    As a code coach at theCoderSchool, I teach and guide young students in the development of software applications. Some apps are simple calculator apps, mad libs generators, and implementations of popular games like Tic-Tac-Toe and Connect Four. Other apps are as complex as web scrapers, networked multi-player space shooters, and Rubik’s cube solvers. Recently I’ve been working with one my advanced students on a video game recommender system. I immediately had the thought to document our process in a series of blog posts.

    A recommender system helps users discover new products and services that users would otherwise not discover on their own. Companies like Amazon and Netflix use recommender systems to suggest new products and movies for their users to buy and watch. Recommender systems work by examining how similar items are to the ones used by users in the past.

    In a series of blog posts I will guide you through the development of different recommender systems using the Steam Video Game Dataset. At the end of this post, you will learn how to create a very basic content-based recommender system that will recommend video games to existing users. 

    This post assumes that you have a solid understanding of Python and the pandas open source python module. If your knowledge of Python and pandas is shakey, make sure to brush up on them before proceeding.

    Checking out the Steam Video Game Dataset

    The Steam Video Game Dataset provides several JSON files that contains information about reviews on the Steam platform, user and item metadata, and item bundles. For the recommender system we will be building in this post, we will be using the User and Item Data and Item metadata JSON files.

    The User and Item Data file contains information collected from over 5 million steam users. Pictured below is the JSON object structure for a single user.

    {'items': [{'item_id':<int>,
          'item_name': <string>,
          'playtime_2weeks': <int>,
          'playtime_forever':<int>},
          …],
      'items_count': <int>,
      'steam_id': <string>,
      'user_id': <string>
      'user_url': <string>}

    The items element contains a list of video games played by a single user and the amount of time the user spent playing each game.

    The items_count element provides the total number of games played by the user.  The steam_id, user_id, and user_url are unique identifies for the user within steam platform.

    The Item metadata file provide data 32,000 games on Steam. Below is the json object for one items in the dataset.

    {'app_name': 'Lost Summoner Kitty',
      'developer': 'Kotoshiro', 
      'discount_price': 4.49, 
      'early_access': False, 
      'genres': ['Action', 'Casual', 'Indie', 'Simulation', 'Strategy'],
      'id': '761140', 
      'price': 4.99, 
      'publisher': 'Kotoshiro',
      'release_date': '2018-01-04', 
      'reviews_url': ' http://steamcommunity.com/app/761140/reviews/?browsefilter=mostrecent&p=1', 
      'specs': ['Single-player'], 
      'tags': ['Strategy', 'Action', 'Indie', 'Casual', 'Simulation'], 
      'title': 'Lost Summoner Kitty',
       'url': ' http://store.steampowered.com/app/761140/Lost_Summoner_Kitty/'}

    Now that you’re acquainted with the data that we will be using, I’ll now provide an overview on how the recommender will be built.

    How the Recommender Will Work

    We will develop a simple algorithm that will find video games that are similar in characteristics to games that user has played in the past. The JSON data files covered in the last section will be used to build two tables.

    The first table, which we will call the game_features table contains the attributes of each game in the dataset.

    game idgenre 1genre ntag 1tag nspec 1spec n
    xxxxxxxx101000
    xxxxxxxx000111
    xxxxxxxx100010
    xxxxxxxx011101
    xxxxxxxx100000

    We will use the genres, tags, and specs fields from the item metadata file to create a set of binary features for each game. Games that have the particular attribute will have the value 1, otherwise the value will be 0.

    The second table will contain preferred video game characteristics for each user. We’ll call this table the user_features table

    user idgenre 1genre ntag 1tag nspec 1spec n
    xxxxxxxx011110
    xxxxxxxx100010
    xxxxxxxx100000
    xxxxxxxx000101
    xxxxxxxx100001

    This table has the same structure as the game_features table. Each binary feature indicates whether or not the user has played a game that has that particular attribute.

    Given an existing user, the algorithm wil recommend new games by performing the following actions:

    1. Filter out all the games from the game_features table that was already played by the user.
    2. Compute a similarity score between each game in the game_features table and the user’s preferred game characteristics. We will be dissimilarity scoring method covered in my post on the K-modes algorithm.
    3. Return the top 10 games with the lowest dissimilarity score.

    Now that you know how the algorithm will work, let’s start coding!

    Loading the Steam Games Dataset

    First things first. We will read the item data file, parse the JSON objects and create a Pandas data frame containing the fields from each object. Because the item metadata file contained improperly formatted JSON, the pandas read_json() function could not be used to create the dataframe. We will need to iterate through each json object indepedently and parse them using the python ast module.

    with open('steam_games.json','r',encoding='utf8') as f:
    data = f.read()
    data = data.strip().split("\n")
    steam_games = []
    # The steam games dataset is not a properly formatted json file. Because of this we need to iterate through each individual JSON object and use
    # the ast module to parse the object.
    for entry in data:
    game = ast.literal_eval(entry)
    # Convert the genres, tags, and specs field from a string type to a list type
    if 'genres' in game:
    game['genres'] = ','.join(game['genres'])
    else:
    game['genres'] = None
    if 'tags' in game:
    game['tags'] = ','.join(game['tags'])
    else:
    game['tags'] = None
    if 'specs' in game:
    game['specs'] = ','.join(game['specs'])
    else:
    game['specs'] = None
    steam_games.append(game)
    # Create a dataframe
    steam_games_df = pd.DataFrame(steam_games)
    del(steam_games)

    Here’s what some of the data from the resulting data frame looks like:

    item metadata dataframe

    Building the Steam Game Features Table

    With the item metadata loaded, we can now build the game_features table. But before we do that, let’s examine the values of genres, tags, and specs attributes. The code snippet below creates a pandas series for each attribute.

    genres = []
    tags = []
    specs = []
    for idx in range(steam_games_df.shape[0]):
      game_genre = steam_games_df.iloc[idx]['genres']
      game_tags = steam_games_df.iloc[idx]['tags']
      game_specs = steam_games_df.iloc[idx]['specs']
      if game_genre:
        genres.extend(steam_games_df.iloc[idx]['genres'].split(","))
      if game_tags:
        tags.extend(steam_games_df.iloc[idx]['tags'].split(","))
      if game_specs:
        specs.extend(steam_games_df.iloc[idx]['specs'].split(","))
    
    genres_srs = pd.Series(genres)
    tags_srs = pd.Series(tags)
    specs_srs = pd.Series(specs)

    Using the pandas series unique() method, we can obtain the unique values for each attribute.

    genres_srs.unique()
    >> array(['Action', 'Casual', 'Indie', 'Simulation', 'Strategy',
           'Free to Play', 'RPG', 'Sports', 'Adventure', 'Racing',
           'Early Access', 'Massively Multiplayer',
           'Animation &amp; Modeling', 'Video Production', 'Utilities',
           'Web Publishing', 'Education', 'Software Training',
           'Design &amp; Illustration', 'Audio Production', 'Photo Editing',
           'Accounting'], dtype=object)
    
    tags_srs.unique()
    >> array(['Strategy', 'Action', 'Indie', 'Casual', 'Simulation',
           'Free to Play', 'RPG', 'Card Game', 'Trading Card Game',
           'Turn-Based', 'Fantasy', 'Tactical', 'Dark Fantasy', 'Board Game',
           'PvP', '2D', 'Competitive', 'Replay Value',
           'Character Customization', 'Female Protagonist', 'Difficult',
           'Design & Illustration', 'Sports', 'Multiplayer', 'Adventure',
           'FPS', 'Shooter', 'Third-Person Shooter', 'Sniper', 'Third Person',
           'Racing', 'Early Access', 'Survival', 'Pixel Graphics', 'Cute',
           'Physics', 'Science', 'VR', 'Tutorial', 'Classic', 'Gore',
           "1990's", 'Singleplayer', 'Sci-fi', 'Aliens', 'First-Person',
           'Story Rich', 'Atmospheric', 'Silent Protagonist',
           'Great Soundtrack', 'Moddable', 'Linear', 'Retro', 'Funny',
           'Turn-Based Strategy', 'Platformer', 'Side Scroller',
           'Massively Multiplayer', 'Clicker', 'Gothic', 'Isometric',
           'Stealth', 'Mystery', 'Assassin', 'Comedy', 'Stylized', 'Co-op',
           'War', 'Rome', 'Historical', 'Open World', 'Realistic', 'Crafting',
           'Trading', 'MMORPG', 'Swordplay', 'Hunting', 'Violent',
           'Experience', 'City Builder', 'Building', 'Economy',
           'Base Building', 'Education', 'Golf', 'Wargame', 'Cold War',
           'Real-Time with Pause', 'RTS', 'Diplomacy', 'Psychological Horror',
           'Sandbox', 'Mod', 'Online Co-Op', 'Animation & Modeling', 'Puzzle',
           'Horror', 'Management', 'Futuristic', 'Cyberpunk', 'Destruction',
           'Music', 'Driving', 'Arcade', 'Mechs', 'Robots', 'Underground',
           'Exploration', 'Point & Click', '4X', 'Trains', 'Top-Down',
           'Underwater', 'Turn-Based Tactics', 'Lovecraftian', 'Lara Croft',
           'Remake', 'Action-Adventure', 'Dinosaurs', 'Parkour', '3D Vision',
           'Hack and Slash', 'Spectacle fighter', 'Character Action Game',
           "Beat 'em up", 'Demons', 'Controller', 'Detective', 'Episodic',
           'Zombies', 'Fast-Paced', '2.5D', 'World War II', 'Supernatural',
           'Alternate History', 'Vampire', 'Space', 'Warhammer 40K',
           'Games Workshop', 'Real-Time', 'Steampunk', 'Dystopian',
           'Political', 'Dark', 'Action RPG', 'Grand Strategy',
           'Real Time Tactics', 'Medieval', 'Hidden Object', 'Crime',
           'Survival Horror', 'Mature', 'Noir', 'Bullet Time', 'Cinematic',
           'Nudity', 'Co-op Campaign', 'FMV', 'Match 3', 'Anime',
           'Touch-Friendly', 'Military', 'Western', 'Family Friendly',
           'Ninja', 'Arena Shooter', 'Naval', 'Agriculture', 'Horses',
           'Flight', 'TrackIR', 'Tanks', 'Cult Classic', 'Puzzle-Platformer',
           'Post-apocalyptic', 'Inventory Management', 'Benchmark',
           'Space Sim', 'Choices Matter', 'Based On A Novel',
           'Multiple Endings', 'Magic', 'LEGO', 'Batman', 'Local Co-Op',
           'Superhero', 'Comic Book', 'Local Multiplayer', 'Offroad',
           'Satire', 'Surreal', 'Capitalism', 'Bowling', 'Dark Humor',
           'Level Editor', 'Mythology', 'Time Attack', 'Colorful', 'Short',
           'Tower Defense', 'Top-Down Shooter', 'Villain Protagonist',
           'Fighting', 'Team-Based', 'Split Screen', 'Party-Based RPG',
           'CRPG', 'Pirates', 'Walking Simulator', 'Psychological', 'Memes',
           '3D Platformer', 'Psychedelic', 'Score Attack', 'Abstract',
           'Hex Grid', 'Tactical RPG', 'Turn-Based Combat', 'America',
           '2D Fighter', 'Star Wars', '1980s', 'Mini Golf',
           'Time Manipulation', 'Time Travel', 'On-Rails Shooter',
           '4 Player Local', 'Relaxing', 'Hand-drawn', 'Dungeon Crawler',
           'Loot', 'Cartoon', 'Mouse only', 'Experimental', 'Dragons',
           'Romance', 'Metroidvania', 'Parody', 'Quick-Time Events',
           'World War I', "Shoot 'Em Up", 'Music-Based Procedural Generation',
           'Twin Stick Shooter', 'Rhythm', 'Bullet Hell', '6DOF', 'Modern',
           'Class-Based', 'PvE', 'Heist', 'Politics', 'Resource Management',
           'Conspiracy', 'Minimalist', 'JRPG', 'Visual Novel', 'Hacking',
           'Strategy RPG', 'Lemmings', 'Illuminati', 'Sexual Content',
           'Movie', 'Blood', 'MOBA', 'Rogue-like', 'Runner', 'Narration',
           'Asynchronous Multiplayer', 'Chess', 'God Game', 'Soundtrack',
           'Procedural Generation', 'Rogue-lite', 'Perma Death',
           'Kickstarter', 'Investigation', 'Thriller', 'Cartoony',
           'Crowdfunded', 'Transhumanism', 'Interactive Fiction',
           'Dating Sim', 'Werewolves', 'Documentary', 'RPGMaker',
           'Gun Customization', 'Video Production', 'Software', 'e-sports',
           'Martial Arts', 'Mars', 'GameMaker', 'Utilities', 'Web Publishing',
           'Game Development', 'Choose Your Own Adventure', 'Text-Based',
           'Football', 'Soccer', 'Intentionally Awkward Controls', 'Gambling',
           'Software Training', 'Sokoban', 'Drama', 'NSFW',
           'Dynamic Narration', 'Typing', 'Pinball', 'Voxel', 'Basketball',
           'Fishing', 'Programming', 'Audio Production', 'Sailing', 'Mining',
           'Dark Comedy', 'Grid-Based Movement', 'Otome', 'Voice Control',
           'Artificial Intelligence', 'Cycling', 'Gaming', 'Photo Editing',
           'Lore-Rich', 'Word Game', 'Pool', 'Conversation', 'Nonlinear',
           'Spelling', 'Foreign', 'Feature Film', 'Hardware', 'Steam Machine',
           'Philisophical', 'Mystery Dungeon', 'Wrestling', '360 Video',
           'Faith', 'Bikes'], dtype=object)
    
    specs_srs.unique()
    >> array(['Single-player', 'Multi-player', 'Online Multi-Player',
           'Cross-Platform Multiplayer', 'Steam Achievements',
           'Steam Trading Cards', 'In-App Purchases', 'Stats',
           'Full controller support', 'HTC Vive', 'Oculus Rift',
           'Tracked Motion Controllers', 'Room-Scale', 'Downloadable Content',
           'Steam Cloud', 'Steam Leaderboards', 'Partial Controller Support',
           'Seated', 'Standing', 'Local Co-op', 'Shared/Split Screen',
           'Valve Anti-Cheat enabled', 'Local Multi-Player',
           'Steam Turn Notifications', 'MMO', 'Co-op', 'Online Co-op',
           'Captions available', 'Commentary available', 'Steam Workshop',
           'Includes level editor', 'Mods', 'Mods (require HL2)', 'Game demo',
           'Includes Source SDK', 'SteamVR Collectibles', 'Keyboard / Mouse',
           'Gamepad', 'Windows Mixed Reality', 'Mods (require HL1)'],
          dtype=object)

    As you can see, the attributes have an exteremly high cardinality. To address it, we will group infrequently occurring values for each attribute into an ‘other’ category. In future posts, we will explore other methods for dealing with categorical data with high cardinality. The code snippet below identifies the groupings for each attribute and then creates column names that will be used when we build the game_features table.

    # Genres that occur less than 1% of the time will be grouped in and 'other' category
    genre_counts = genres_srs.value_counts(normalize=True)
    other_genres = genre_counts[genre_counts < 0.01].index.to_list()
    genre_main_categories = ['genre_'+x for x in genre_counts[genre_counts >= 0.01].index.to_list()]
    # Tags that occur less than 0.1% of the time will be grouped in an 'other' category
    tag_counts = tags_srs.value_counts(normalize=True)
    other_tags = tag_counts[tag_counts < 0.001].index.to_list()
    tag_main_categories = ['tag_'+x for x in tag_counts[tag_counts >= 0.001].index.to_list()]
    # Specs that occur less than 0.1% of the time will be grouped in an 'other' category
    specs_counts = specs_srs.value_counts(normalize=True)
    other_specs = specs_counts[specs_counts < 0.01].index.to_list()
    specs_main_categories = ['spec_'+x for x in specs_counts[specs_counts >= 0.001].index.to_list()]
    game_features = []
    for idx in range(steam_games_df.shape[0]):
    # Obtain list of genres, tags, and specs
    game_genre = steam_games_df.iloc[idx]['genres']
    game_tags = steam_games_df.iloc[idx]['tags']
    game_specs = steam_games_df.iloc[idx]['specs']
    data_row = {k:0 for k in feature_columns}
    data_row['id'] = steam_games_df.iloc[idx]['id']
    # Iterate through each entry in the lists and create the binary features
    if game_genre:
    for genre in game_genre.split(','):
    if genre in other_genres:
    data_row['genre_other'] = 1
    else:
    data_row['genre_'+genre] = 1
    if game_tags:
    for tag in game_tags.split(','):
    if tag in other_tags:
    data_row['tag_other'] = 1
    else:
    data_row['tag_'+tag] = 1
    if game_specs:
    for spec in game_specs.split(','):
    if spec in other_specs:
    data_row['spec_other'] = 1
    else:
    data_row['spec_'+spec] = 1
    game_features.append(data_row)
    game_features_df = pd.DataFrame(game_features)

    Here’s what the dataframe looks like:

    item features table

    Building the User Features Table

    We will now use the game_features table built in the last section to create the user_features table. We will load the user items json file, retrieve the list of games played for each user. Each game will be cross-referenced with the game_features table to determine the user’s preferred game characteristics. We will also store the user’s play history for later usage when it comes time to recommend new games to the user. Because the user items json file contains data for over 5 million users, the code snippet below will take a considerable amount of time to execute. To speed things up, you can reduce the number of records that are parsed.

    game_features_df = game_features_df.set_index('id')
    # Convert game features dataframe into a dictionary to speed up the construction of the users table
    game_feat_dict = game_features_df.to_dict()
    # Read user items data file and build features table
    with open('/content/drive/MyDrive/VideoGameRecFiles/australian_users_items.json','r',encoding='utf8') as f:
    data = f.read()
    data = data.strip().split("\n")
    user_features = []
    # We need to keep track of all the games each user played so we can avoid recommending games that they have already played.
    user_play_list = {}
    for user_data in data:
    # The stdataset is not a properly formatted json file. Because of this we need to iterate through each individual JSON object and use
    # the ast module to parse the object.
    record = ast.literal_eval(user_data)
    data_row = {k:0 for k in feature_columns}
    data_row['user_id'] = record['user_id']
    play_list = []
    for item in record['items']:
    item_id = item['item_id']
    play_list.append(item_id)
    for col in feature_columns:
    if item_id in game_feat_dict[col]:
    if game_feat_dict[col][item_id] == 1:
    data_row[col] = 1
    user_play_list[record['user_id']] = play_list
    user_features.append(data_row)
    user_features_df = pd.DataFrame(user_features)
    user_features_df = user_features_df.set_index('user_id')

    Here’s what the resulting table will look like:

    user features table

    The Recommender Algorithm

    With the game_features and user_features table now in place, we can now code the recommender algorithm.

    def dissimilarity_score(user, item):
    '''
    Given a row from the user features table and a row from the item features table, compute the dissimiliary score between the user and the item.
    The higher the score, the more dissimilar the two records are.
    '''
    score = 0
    for col in feature_columns:
    data = 0
    if user[col] != item[col]:
    score += 1
    return score
    def recommend_games(user_id, n=10):
    '''
    Given a user id, recommend games to that user. By default 10 games are recommended
    '''
    # Get user features
    user = user_features_df.loc[user_id]
    # Get games played by the user
    play_list = user_play_list[user_id]
    # Filter out the games already played by the user from the games features table
    filtered_df = game_features_df[~game_features_df.index.isin(play_list)]
    # Compute dissimilarity scores for each game
    scores = filtered_df.apply(lambda game: dissimilarity_score(user, game), axis=1)
    # Return the top n games
    recommended_games = scores.sort_values()[:n].index
    return recommended_games

    The code snippet above defines two functions. The dissimilarity_score function is the scoring function that will determine how similar a game is to a user’s game characteristics. The recommend_games function uses the dissimilarity_score function to provide the ids of games that are similar to a user’s preferences.

    Here’s what we get when recommeding new games for the steam user ‘evcentric’

    recommended_games = recommend_games('evcentric')
    recommended_games
    >>Index(['451600', '302670', '346330', '204360', '238460', '512900', '351100',
           '306460', '344890', '257750'],
          dtype='object', name='id')

    We can use the following code snippet to get the names of the titles

    for game in recommended_games:
      filtered_games = steam_games_df[steam_games_df.id == game]
      game_name = filtered_games.iloc[0]['app_name']
      print(game_name)

    We get the following as output

    CounterAttack
    Call to Arms
    BrainBread 2
    Castle Crashers®
    BattleBlock Theater®
    Streets of Rogue
    Niffelheim
    Unturned - Permanent Gold Upgrade
    ARM PLANETARY PROSPECTORS Asteroid Resource Mining
    Bloody Trapland

    That’s all, Folks!

    We have created a basic video game recommender. There are definitely more enhancements we can make to the recommender in order to get better results. That is what we will be covering in the next series of posts. You can find the code for the entire solution here.