machine-learning/notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a3a7634f",
   "metadata": {},
   "source": [
    "# Machine Learning project in SoSe 2025 at HTW Saar\n",
    "## Idea\n",
    "The goal of this project is predicting the genre(s) of a game/bundle through its given description(s)\n",
    "\n",
    "## Dataset\n",
    "For our project we use a Steam Dataset provided on moodle, since it has all information we plan on using.\n",
    "The Dataset has been cut to only 2000 data points to be runnable on weaker devices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3116b75f",
   "metadata": {
    "jupyter": {
     "is_executing": true
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn import set_config\n",
    "\n",
    "set_config(transform_output=\"pandas\")\n",
    "\n",
    "dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n",
    "print(dataset.head(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cba9750a",
   "metadata": {},
   "source": [
    "## Preparation of the Dataset\n",
    "### Removing Uniques\n",
    "We would remove the following features from the Training-Set as they can/could uniquely identify a datapoint, but we don't as they will be removed in the next step anyway\n",
    "- AppId\n",
    "- Name of the Game\n",
    "- Realease Date\n",
    "- Reviews\n",
    "- Header Image\n",
    "- Website\n",
    "- Support URL\n",
    "- Support Email\n",
    "- MetaCritic URL\n",
    "- Developer\n",
    "- Publisher\n",
    "- Screenshots\n",
    "- Movies\n",
    "- Estimated Owners"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d159117377f3633c",
   "metadata": {},
   "outputs": [],
   "source": [
    "#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
    "#print(dataset.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1b28ddd69f1e9a6",
   "metadata": {},
   "source": [
    "## Hold onto necessary information\n",
    "Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n",
    "We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n",
    "\n",
    "It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "986fbb31a7ae0d8b",
   "metadata": {
    "jupyter": {
     "is_executing": true
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.preprocessing import FunctionTransformer\n",
    "\n",
    "# desc, genres\n",
    "column_transformer = ColumnTransformer([\n",
    "        # merge all descriptions\n",
    "        ('desc', FunctionTransformer(lambda X: X.fillna('').agg(' '.join, axis=1).to_frame(name=\"desc\")),\n",
    "            ['detailed_description', 'about_the_game', 'short_description']),\n",
    "        ('pass', 'passthrough', ['genres']),\n",
    "    ],\n",
    "    verbose_feature_names_out=False\n",
    ")\n",
    "dataset = column_transformer.fit_transform(dataset)\n",
    "print(dataset.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9b89c0645811564",
   "metadata": {},
   "source": [
    "### Adding missing Information\n",
    "Some Games might not have any descriptions. For these we Input an Empty String\n",
    "**TODO: check if dropna and fillna numeric_only is needed, as we dont have any numbers**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44239f6b7fd23cde",
   "metadata": {},
   "outputs": [],
   "source": [
    "# missing numeric values => mean\n",
    "dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
    "# missing strings => empty string?\n",
    "dataset.fillna('', inplace=True)\n",
    "# drop all lines with missing values\n",
    "dataset.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca5b59b9fa8160a0",
   "metadata": {},
   "source": [
    "## Transform Genres\n",
    "The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n",
    "\n",
    "#### Serializing the String-Array\n",
    "The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebc5a24e9bc87fdd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import ast\n",
    "\n",
    "dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n",
    "print(dataset['genres'].head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f90756f9ad9211f4",
   "metadata": {},
   "source": [
    "#### One-Hot-Encoding an Python-Array\n",
    "The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n",
    "Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2c3527a5fc876bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MultiLabelBinarizer\n",
    "\n",
    "mlb_genres = MultiLabelBinarizer()\n",
    "genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n",
    "genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n",
    "print(genres_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "671c01f9f4ae66d9",
   "metadata": {},
   "source": [
    "With this, our target matrix is completed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5436c87",
   "metadata": {},
   "source": [
    "### Structurizing Text\n",
    "If we want our Model to be able to use text as an input, we have to vectorize the text. TF-IDF (Inverse Document Frequency) is an easy way of transforming each word into a feature with a 0 to 1 value. **TODO: filter out stopwords**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e8b407c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer()\n",
    "tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
    "tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
    "print(tfidf_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad84e777",
   "metadata": {},
   "source": [
    "With this our feature matrix is completed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86d9da42f4df8e49",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = tfidf_df\n",
    "y = genres_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aeb782668f311cd8",
   "metadata": {},
   "source": [
    "## The Model\n",
    "\n",
    "####  Removing unpredicatble Datapoints\n",
    "Some Datapoints don't have a genre assigned (all feature values in y are 0). The model we use can't handle such cases, thus they have to be removed.\n",
    "We filter after all values that we can use with a mask, and apply that mask to our matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4919bf1b37d171a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = y.sum(axis=1).map(lambda x: x > 0)\n",
    "print((mask == False).sum()) # count of unpredictable datapoints\n",
    "\n",
    "X_clean = X[mask]\n",
    "y_clean = y[mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "091d7e13",
   "metadata": {},
   "source": [
    "# Splitting up data\n",
    "We have to split up our data into training and testing data.\n",
    "Using random_state=0 guarantees reproducability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cfbf3787",
   "metadata": {
    "jupyter": {
     "is_executing": true
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84f56229",
   "metadata": {},
   "source": [
    "Now that all data is prepared, we need to choose a Classification Model that meets our stanadrds."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "917ba82f",
   "metadata": {},
   "source": [
    "# Excursion: Choosing a classification Model\n",
    "``sklearn`` has many different classification Models to choose from, but we only have limited time and computing power.\n",
    "As such, we tested many different models on the 2k Dataset and chose the 5 best performing ones for the big dataset.\n",
    "\n",
    "### Initial Comparison\n",
    "We won't put the comparison script in this notebook, but you can find it in the ``compare_models_2k.py`` file and try it out yourself.\n",
    "There were some rules as a baseline for comparison:\n",
    "- All Hyperparameters are set to default\n",
    "- All iteration limits are set to 3000 (exception: MLPClassifier with 300, where i-limit are epochs instead of iterations )\n",
    "- All ``random_state``s are set to 0\n",
    "\n",
    "Running all models with that configuration yields the following weighted F1-Scores (results as seen in the ``games_march2025_cleaned_2k_i3k`` folder): \n",
    "\n",
    "![Comparison Image 2k](./compare_models_2k.png)\n",
    "\n",
    "If we also compare Micro/Macro values, we see that all models have a much lower Macro-F1 than Micro/Weighted-F1. That is because the 2k Dataset does not contain enough datapoints for every class (test data for 2 classes is 0), so we should proceed to the 10k Dataset before making major choices.\n",
    "\n",
    "![Comparison Image 2k Micro/Macro/Weighted](./compare_models_2k_3.png)\n",
    "\n",
    "The 10 best performing models which will run on the 10k Dataset with the same rules as before:\n",
    "1. NearestCentroid\n",
    "2. Perceptron\n",
    "3. PassiveAggressiveClassifier\n",
    "4. LinearSVC\n",
    "5. SDGClassifer\n",
    "6. HistGradientBoostingClassifier\n",
    "7. MLPClassifier\n",
    "8. RidgeClassifier\n",
    "9. GradientBoostingClassifier\n",
    "10. LinearDiscriminationAnalysis\n",
    "\n",
    "![Comparison Image 10k](./compare_models_10k.png)\n",
    "\n",
    "We can also compare these models between datasets, to see if a bigger dataset always improves the performance.\n",
    "\n",
    "![Comparison Image between 2k and 10k](./compare_models_2k_10k.png)\n",
    "\n",
    "The final contenders are:\n",
    "1.\n",
    "2.\n",
    "3.\n",
    "4.\n",
    "5.\n",
    "\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12b5283d",
   "metadata": {},
   "source": [
    "## Model Selection\n",
    "**TODO Deciding which model to use for this task**\n",
    "\n",
    "As a game can have multiple genres, our Model(s) has to be capable of multi-label-classification. sklearn's ``MultiOutputClassifier`` can do this. As a backend for ``MultiOutputClassifier`` we use ``LogisticRegression``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c1d72c4532bd509",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.multioutput import MultiOutputClassifier\n",
    "\n",
    "# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is to bad)\n",
    "multi_target_clf = MultiOutputClassifier(LogisticRegression(max_iter=1337, random_state=0), n_jobs=1)\n",
    "\n",
    "multi_target_clf.fit(X_train, y_train)\n",
    "\n",
    "y_pred = multi_target_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0faa9856",
   "metadata": {},
   "source": [
    "# Evaluation\n",
    "**TODO Test the Model with the test data**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2ebea6945193e07",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import classification_report\n",
    "\n",
    "print(classification_report(y_test, y_pred, zero_division=0.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2aeb6fc2",
   "metadata": {},
   "source": [
    "# Optimization\n",
    "**TODO optimize the model based on the test results**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79b20645",
   "metadata": {},
   "source": [
    "# Validation\n",
    "**TODO Predict actual values**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b709fb7",
   "metadata": {},
   "source": [
    "# Conclusion and outlook\n",
    "**TODO Write a conclusion and outlook what can be done and where the issues were.**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}