{ "cells": [ { "cell_type": "markdown", "id": "a3a7634f", "metadata": {}, "source": [ "# Machine Learning project in SoSe 2025 at HTW Saar\n", "## Idea\n", "The goal of this project is predicting the genre(s) of a game/bundle through its given description(s)\n", "\n", "## Dataset\n", "For our project we use a Steam Dataset provided on moodle, since it has all information we plan on using.\n", "The Dataset has been cut to only 2000 data points to be runnable on weaker devices." ] }, { "cell_type": "code", "id": "3116b75f", "metadata": { "jupyter": { "is_executing": true } }, "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn import set_config\n", "\n", "set_config(transform_output=\"pandas\")\n", "\n", "dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n", "print(dataset.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "cba9750a", "metadata": {}, "source": [ "## Preparation of the Dataset\n", "### Removing Uniques\n", "We would remove the following features from the Training-Set as they can/could uniquely identify a datapoint, but we don't as they will be removed in the next step anyway\n", "- AppId\n", "- Name of the Game\n", "- Realease Date\n", "- Reviews\n", "- Header Image\n", "- Website\n", "- Support URL\n", "- Support Email\n", "- MetaCritic URL\n", "- Developer\n", "- Publisher\n", "- Screenshots\n", "- Movies\n", "- Estimated Owners" ] }, { "metadata": {}, "cell_type": "code", "source": [ "#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n", "#print(dataset.head())" ], "id": "d159117377f3633c", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Hold onto necessary information\n", "Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n", "We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n", "\n", "It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed" ], "id": "e1b28ddd69f1e9a6" }, { "metadata": { "jupyter": { "is_executing": true } }, "cell_type": "code", "source": [ "from sklearn.compose import ColumnTransformer\n", "from sklearn.preprocessing import FunctionTransformer\n", "\n", "# desc, genres\n", "column_transformer = ColumnTransformer([\n", " # merge all descriptions\n", " ('desc', FunctionTransformer(lambda X: X.fillna('').agg(' '.join, axis=1).to_frame(name=\"desc\")),\n", " ['detailed_description', 'about_the_game', 'short_description']),\n", " ('pass', 'passthrough', ['genres']),\n", " ],\n", " verbose_feature_names_out=False\n", ")\n", "dataset = column_transformer.fit_transform(dataset)\n", "print(dataset.head())" ], "id": "986fbb31a7ae0d8b", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Adding missing Information\n", "Some Games might not have any descriptions. For these we Input an Empty String\n", "**TODO: check if dropna and fillna numeric_only is needed, as we dont have any numbers**" ], "id": "f9b89c0645811564" }, { "metadata": {}, "cell_type": "code", "source": [ "# missing numeric values => mean\n", "dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n", "# missing strings => empty string?\n", "dataset.fillna('', inplace=True)\n", "# drop all lines with missing values\n", "dataset.dropna(inplace=True)" ], "id": "44239f6b7fd23cde", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Transform Genres\n", "The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n", "\n", "#### Serializing the String-Array\n", "The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres." ], "id": "ca5b59b9fa8160a0" }, { "metadata": {}, "cell_type": "code", "source": [ "import ast\n", "\n", "dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n", "print(dataset['genres'])" ], "id": "ebc5a24e9bc87fdd", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### One-Hot-Encoding an Python-Array\n", "The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n", "Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support." ], "id": "f90756f9ad9211f4" }, { "metadata": {}, "cell_type": "code", "source": [ "from sklearn.preprocessing import MultiLabelBinarizer\n", "\n", "mlb_genres = MultiLabelBinarizer()\n", "genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n", "genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n", "print(genres_df.head())" ], "id": "d2c3527a5fc876bf", "outputs": [], "execution_count": null }, { "metadata": {}, "cell_type": "markdown", "source": "With this, our target matrix is completed.", "id": "671c01f9f4ae66d9" }, { "cell_type": "markdown", "id": "f5436c87", "metadata": {}, "source": [ "### Structurizing Text\n", "If we want our Model to be able to use text as an input, we have to vectorize the text. TF-IDF (Inverse Document Frequency) is an easy way of transforming each word into a feature with a 0 to 1 value. **TODO: filter out stopwords**" ] }, { "cell_type": "code", "id": "4e8b407c", "metadata": {}, "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vectorizer = TfidfVectorizer()\n", "tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n", "tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n", "print(tfidf_df.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "ad84e777", "metadata": {}, "source": "With this our feature matrix is completed" }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": [ "X = tfidf_df\n", "y = genres_df" ], "id": "86d9da42f4df8e49" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## The Model\n", "\n", "#### Removing unpredicatble Datapoints\n", "Some Datapoints don't have a genre assigned (all feature values in y are 0). The model we use can't handle such cases, thus they have to be removed.\n", "We filter after all values that we can use with a mask, and apply that mask to our matrices." ], "id": "aeb782668f311cd8" }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": [ "mask = y.sum(axis=1).map(lambda x: x > 0)\n", "print((mask == False).sum()) # count of unpredictable datapoints\n", "\n", "X_clean = X[mask]\n", "y_clean = y[mask]" ], "id": "4919bf1b37d171a7" }, { "cell_type": "markdown", "id": "091d7e13", "metadata": {}, "source": [ "# Splitting up data\n", "We have to split up our data into training and testing data.\n", "Using random_state=0 guarantees reproducability." ] }, { "cell_type": "code", "id": "cfbf3787", "metadata": { "jupyter": { "is_executing": true } }, "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "id": "12b5283d", "metadata": {}, "source": [ "# Model Selection\n", "**TODO Deciding which model to use for this task**\n", "\n", "As a game can have multiple genres, our Model(s) has to be capable of multi-label-classification. sklearn's ``MultiOutputClassifier`` can do this. As a backend for ``MultiOutputClassifier`` we use ``LogisticRegression``" ] }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": [ "# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is too bad)\n", "multi_target_clf = MultiOutputClassifier(LogisticRegression(max_iter=1337, random_state=0), n_jobs=1)\n", "\n", "multi_target_clf.fit(X_train, y_train)\n", "\n", "y_pred = multi_target_clf.predict(X_test)" ], "id": "8c1d72c4532bd509" }, { "cell_type": "markdown", "id": "0faa9856", "metadata": {}, "source": [ "# Evaluation\n", "**TODO Test the Model with the test data**" ] }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": "print(classification_report(y_test, y_pred, zero_division=0.0))", "id": "e2ebea6945193e07" }, { "cell_type": "markdown", "id": "2aeb6fc2", "metadata": {}, "source": [ "# Optimization\n", "**TODO optimize the model based on the test results**" ] }, { "cell_type": "markdown", "id": "79b20645", "metadata": {}, "source": [ "# Validation\n", "**TODO Predict actual values**" ] }, { "cell_type": "markdown", "id": "3b709fb7", "metadata": {}, "source": [ "# Conclusion and outlook\n", "**TODO Write a conclusion and outlook what can be done and where the issues were.**" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 5 }