jupyter notebook

2025-08-12 19:09:53 +02:00
parent ac39214e0d
commit 9c3dd33c0b
3 changed files with 226 additions and 261 deletions
--- a/notebook.ipynb
+++ b/notebook.ipynb
@@ -7,103 +7,42 @@
   "source": [
    "# Machine Learning project in SoSe 2025 at HTW Saar\n",
    "## Idea\n",
-    "The goal of this project is getting the genre(s) of a game trough its given metadata\n",
+    "The goal of this project is predicting the genre(s) of a game/bundle through its given description(s)\n",
    "\n",
    "## Dataset\n",
-    "For our project we use a Steam dataSet from kaggle. You can find it under the following URL: [Kaggle.com](https://www.kaggle.com/datasets/artermiloff/steam-games-dataset/data)\n",
-    "\n",
-    "### Importing the dataSet\n",
-    "The dataSet is imported and added as a variable."
+    "For our project we use a Steam Dataset provided on moodle, since it has all information we plan on using.\n",
+    "The Dataset has been cut to only 2000 data points to be runnable on weaker devices."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
   "id": "3116b75f",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "    appid                             name release_date  required_age  price  \\\n",
-      "0     730                 Counter-Strike 2   2012-08-21             0   0.00   \n",
-      "1  578080              PUBG: BATTLEGROUNDS   2017-12-21             0   0.00   \n",
-      "2     570                           Dota 2   2013-07-09             0   0.00   \n",
-      "3  271590        Grand Theft Auto V Legacy   2015-04-13            17   0.00   \n",
-      "4  359550  Tom Clancy's Rainbow Six® Siege   2015-12-01            17   3.99   \n",
-      "\n",
-      "   dlc_count                               detailed_description  \\\n",
-      "0          1  For over two decades, Counter-Strike has offer...   \n",
-      "1          0  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
-      "2          2  The most-played game on Steam. Every day, mill...   \n",
-      "3          0  When a young street hustler, a retired bank ro...   \n",
-      "4          9  Edition Comparison Ultimate Edition The Tom Cl...   \n",
-      "\n",
-      "                                      about_the_game  \\\n",
-      "0  For over two decades, Counter-Strike has offer...   \n",
-      "1  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
-      "2  The most-played game on Steam. Every day, mill...   \n",
-      "3  When a young street hustler, a retired bank ro...   \n",
-      "4  “One of the best first-person shooters ever ma...   \n",
-      "\n",
-      "                                   short_description  \\\n",
-      "0  For over two decades, Counter-Strike has offer...   \n",
-      "1  Play PUBG: BATTLEGROUNDS for free. Land on str...   \n",
-      "2  Every day, millions of players worldwide enter...   \n",
-      "3  Grand Theft Auto V for PC offers players the o...   \n",
-      "4  Tom Clancy's Rainbow Six® Siege is an elite, t...   \n",
-      "\n",
-      "                                             reviews  ...  \\\n",
-      "0                                                NaN  ...   \n",
-      "1                                                NaN  ...   \n",
-      "2  “A modern multiplayer masterpiece.” 9.5/10 – D...  ...   \n",
-      "3                                                NaN  ...   \n",
-      "4                                                NaN  ...   \n",
-      "\n",
-      "  average_playtime_2weeks median_playtime_forever median_playtime_2weeks  \\\n",
-      "0                     879                    5174                    350   \n",
-      "1                       0                       0                      0   \n",
-      "2                    1536                     898                    892   \n",
-      "3                     771                    7101                     74   \n",
-      "4                     682                    2434                    306   \n",
-      "\n",
-      "  discount  peak_ccu                                               tags  \\\n",
-      "0        0   1212356  {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'...   \n",
-      "1        0    616738  {'Survival': 14838, 'Shooter': 12727, 'Battle ...   \n",
-      "2        0    555977  {'Free to Play': 59933, 'MOBA': 20158, 'Multip...   \n",
-      "3        0    117698  {'Open World': 32644, 'Action': 23539, 'Multip...   \n",
-      "4       80     89916  {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '...   \n",
-      "\n",
-      "   pct_pos_total  num_reviews_total pct_pos_recent  num_reviews_recent  \n",
-      "0             86            8632939             82               96473  \n",
-      "1             59            2513842             68               16720  \n",
-      "2             81            2452595             80               29366  \n",
-      "3             87            1803832             92               17517  \n",
-      "4             84            1168020             76               12608  \n",
-      "\n",
-      "[5 rows x 47 columns]\n"
-     ]
+   "metadata": {
+    "jupyter": {
+     "is_executing": true
    }
-   ],
+   },
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
+    "from sklearn import set_config\n",
    "\n",
-    "# load data\n",
-    "# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
-    "dataset = pd.read_csv(\"./games_march2025_cleaned_10k.csv\",sep=\",\")\n",
+    "set_config(transform_output=\"pandas\")\n",
+    "\n",
+    "dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n",
    "print(dataset.head())"
-   ]
+   ],
+   "outputs": [],
+   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "cba9750a",
   "metadata": {},
   "source": [
-    "## Preparation of the Training-Set\n",
+    "## Preparation of the Dataset\n",
    "### Removing Uniques\n",
-    "We remove the following features from the Training-Set as they can uniquely identify a datapoint:\n",
+    "We would remove the following features from the Training-Set as they can/could uniquely identify a datapoint, but we don't as they will be removed in the next step anyway\n",
    "- AppId\n",
    "- Name of the Game\n",
    "- Realease Date\n",
@@ -121,213 +60,228 @@
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "06dedcdf",
   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "   required_age  price  dlc_count  \\\n",
-      "0             0   0.00          1   \n",
-      "1             0   0.00          0   \n",
-      "2             0   0.00          2   \n",
-      "3            17   0.00          0   \n",
-      "4            17   3.99          9   \n",
-      "\n",
-      "                                detailed_description  \\\n",
-      "0  For over two decades, Counter-Strike has offer...   \n",
-      "1  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
-      "2  The most-played game on Steam. Every day, mill...   \n",
-      "3  When a young street hustler, a retired bank ro...   \n",
-      "4  Edition Comparison Ultimate Edition The Tom Cl...   \n",
-      "\n",
-      "                                      about_the_game  \\\n",
-      "0  For over two decades, Counter-Strike has offer...   \n",
-      "1  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
-      "2  The most-played game on Steam. Every day, mill...   \n",
-      "3  When a young street hustler, a retired bank ro...   \n",
-      "4  “One of the best first-person shooters ever ma...   \n",
-      "\n",
-      "                                   short_description  \\\n",
-      "0  For over two decades, Counter-Strike has offer...   \n",
-      "1  Play PUBG: BATTLEGROUNDS for free. Land on str...   \n",
-      "2  Every day, millions of players worldwide enter...   \n",
-      "3  Grand Theft Auto V for PC offers players the o...   \n",
-      "4  Tom Clancy's Rainbow Six® Siege is an elite, t...   \n",
-      "\n",
-      "                                             reviews  windows    mac  linux  \\\n",
-      "0                                                NaN     True  False   True   \n",
-      "1                                                NaN     True  False  False   \n",
-      "2  “A modern multiplayer masterpiece.” 9.5/10 – D...     True   True   True   \n",
-      "3                                                NaN     True  False  False   \n",
-      "4                                                NaN     True  False  False   \n",
-      "\n",
-      "   ...  average_playtime_2weeks  median_playtime_forever  \\\n",
-      "0  ...                      879                     5174   \n",
-      "1  ...                        0                        0   \n",
-      "2  ...                     1536                      898   \n",
-      "3  ...                      771                     7101   \n",
-      "4  ...                      682                     2434   \n",
-      "\n",
-      "   median_playtime_2weeks discount peak_ccu  \\\n",
-      "0                     350        0  1212356   \n",
-      "1                       0        0   616738   \n",
-      "2                     892        0   555977   \n",
-      "3                      74        0   117698   \n",
-      "4                     306       80    89916   \n",
-      "\n",
-      "                                                tags pct_pos_total  \\\n",
-      "0  {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'...            86   \n",
-      "1  {'Survival': 14838, 'Shooter': 12727, 'Battle ...            59   \n",
-      "2  {'Free to Play': 59933, 'MOBA': 20158, 'Multip...            81   \n",
-      "3  {'Open World': 32644, 'Action': 23539, 'Multip...            87   \n",
-      "4  {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '...            84   \n",
-      "\n",
-      "  num_reviews_total pct_pos_recent  num_reviews_recent  \n",
-      "0           8632939             82               96473  \n",
-      "1           2513842             68               16720  \n",
-      "2           2452595             80               29366  \n",
-      "3           1803832             92               17517  \n",
-      "4           1168020             76               12608  \n",
-      "\n",
-      "[5 rows x 34 columns]\n"
-     ]
-    }
-   ],
+   "cell_type": "code",
   "source": [
-    "# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
-    "dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email',\n",
-    "              'metacritic_url', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'],\n",
-    "              axis=1, inplace=True)\n",
+    "#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
+    "#print(dataset.head())"
+   ],
+   "id": "d159117377f3633c",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## Hold onto necessary information\n",
+    "Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n",
+    "We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n",
+    "\n",
+    "It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed"
+   ],
+   "id": "e1b28ddd69f1e9a6"
+  },
+  {
+   "metadata": {
+    "jupyter": {
+     "is_executing": true
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.preprocessing import FunctionTransformer\n",
+    "\n",
+    "# desc, genres\n",
+    "column_transformer = ColumnTransformer([\n",
+    "        # merge all descriptions\n",
+    "        ('desc', FunctionTransformer(lambda X: X.fillna('').agg(' '.join, axis=1).to_frame(name=\"desc\")),\n",
+    "            ['detailed_description', 'about_the_game', 'short_description']),\n",
+    "        ('pass', 'passthrough', ['genres']),\n",
+    "    ],\n",
+    "    verbose_feature_names_out=False\n",
+    ")\n",
+    "dataset = column_transformer.fit_transform(dataset)\n",
    "print(dataset.head())"
-   ]
+   ],
+   "id": "986fbb31a7ae0d8b",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "### Adding missing Information\n",
+    "Some Games might not have any descriptions. For these we Input an Empty String\n",
+    "**TODO: check if dropna and fillna numeric_only is needed, as we dont have any numbers**"
+   ],
+   "id": "f9b89c0645811564"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "# missing numeric values => mean\n",
+    "dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
+    "# missing strings => empty string?\n",
+    "dataset.fillna('', inplace=True)\n",
+    "# drop all lines with missing values\n",
+    "dataset.dropna(inplace=True)"
+   ],
+   "id": "44239f6b7fd23cde",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## Transform Genres\n",
+    "The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n",
+    "\n",
+    "#### Serializing the String-Array\n",
+    "The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres."
+   ],
+   "id": "ca5b59b9fa8160a0"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "import ast\n",
+    "\n",
+    "dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n",
+    "print(dataset['genres'])"
+   ],
+   "id": "ebc5a24e9bc87fdd",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "#### One-Hot-Encoding an Python-Array\n",
+    "The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n",
+    "Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support."
+   ],
+   "id": "f90756f9ad9211f4"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "source": [
+    "from sklearn.preprocessing import MultiLabelBinarizer\n",
+    "\n",
+    "mlb_genres = MultiLabelBinarizer()\n",
+    "genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n",
+    "genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n",
+    "print(genres_df.head())"
+   ],
+   "id": "d2c3527a5fc876bf",
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": "With this, our target matrix is completed.",
+   "id": "671c01f9f4ae66d9"
  },
  {
   "cell_type": "markdown",
   "id": "f5436c87",
   "metadata": {},
   "source": [
-    "### Structurize Text\n",
-    "**TODO: check if makes sense**\n",
-    "The dataset holds a lot of unstructured data, we use Term Frequency-Inverse Document Frequency to structurize most Text-Features.\n",
-    "It is important to use an new Instance for each feature so they don't overlap with each other. \n",
-    "\n",
-    "### Standardize Values\n",
-    "We standardize only the text features to remove the stop words. The dataset allready provides standardized numerical features."
+    "### Structurizing Text\n",
+    "If we want our Model to be able to use text as an input, we have to vectorize the text. TF-IDF (Inverse Document Frequency) is an easy way of transforming each word into a feature with a 0 to 1 value. **TODO: filter out stopwords**"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
   "id": "4e8b407c",
   "metadata": {},
-   "outputs": [
-    {
-     "ename": "ValueError",
-     "evalue": "all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
-      "\u001b[31mValueError\u001b[39m                                Traceback (most recent call last)",
-      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 11\u001b[39m\n\u001b[32m      3\u001b[39m \u001b[38;5;66;03m# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\u001b[39;00m\n\u001b[32m      4\u001b[39m column_transformer = make_column_transformer(\n\u001b[32m      5\u001b[39m     (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mdetailed_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m      6\u001b[39m     (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mabout_the_game\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m      7\u001b[39m     (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mshort_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m      8\u001b[39m     (\u001b[33m'\u001b[39m\u001b[33mpassthrough\u001b[39m\u001b[33m'\u001b[39m, [\u001b[33m'\u001b[39m\u001b[33mrequired_age\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mprice\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdlc_count\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mreviews\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mwindows\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmac\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mlinux\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmetacritic_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33machievements\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mrecommendations\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnotes\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33msupported_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mfull_audio_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mcategories\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mgenres\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33muser_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mscore_rank\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpositive\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnegative\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdiscount\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpeak_ccu\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mtags\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_recent\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_recent\u001b[39m\u001b[33m'\u001b[39m])\n\u001b[32m      9\u001b[39m )\n\u001b[32m---> \u001b[39m\u001b[32m11\u001b[39m dataset = \u001b[43mcolumn_transformer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdataset\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     12\u001b[39m \u001b[38;5;28mprint\u001b[39m(dataset.head())\n",
-      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\utils\\_set_output.py:319\u001b[39m, in \u001b[36m_wrap_method_output.<locals>.wrapped\u001b[39m\u001b[34m(self, X, *args, **kwargs)\u001b[39m\n\u001b[32m    317\u001b[39m \u001b[38;5;129m@wraps\u001b[39m(f)\n\u001b[32m    318\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mwrapped\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m319\u001b[39m     data_to_wrap = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    320\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data_to_wrap, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m    321\u001b[39m         \u001b[38;5;66;03m# only wrap the first output for cross decomposition\u001b[39;00m\n\u001b[32m    322\u001b[39m         return_tuple = (\n\u001b[32m    323\u001b[39m             _wrap_data_with_container(method, data_to_wrap[\u001b[32m0\u001b[39m], X, \u001b[38;5;28mself\u001b[39m),\n\u001b[32m    324\u001b[39m             *data_to_wrap[\u001b[32m1\u001b[39m:],\n\u001b[32m    325\u001b[39m         )\n",
-      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m   1382\u001b[39m     estimator._validate_params()\n\u001b[32m   1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m   1385\u001b[39m     skip_parameter_validation=(\n\u001b[32m   1386\u001b[39m         prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m   1387\u001b[39m     )\n\u001b[32m   1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
-      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1031\u001b[39m, in \u001b[36mColumnTransformer.fit_transform\u001b[39m\u001b[34m(self, X, y, **params)\u001b[39m\n\u001b[32m   1028\u001b[39m \u001b[38;5;28mself\u001b[39m._validate_output(Xs)\n\u001b[32m   1029\u001b[39m \u001b[38;5;28mself\u001b[39m._record_output_indices(Xs)\n\u001b[32m-> \u001b[39m\u001b[32m1031\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_hstack\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlist\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m)\u001b[49m\n",
-      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1225\u001b[39m, in \u001b[36mColumnTransformer._hstack\u001b[39m\u001b[34m(self, Xs, n_samples)\u001b[39m\n\u001b[32m   1215\u001b[39m         \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m   1216\u001b[39m             \u001b[33m\"\u001b[39m\u001b[33mConcatenating DataFrames from the transformer\u001b[39m\u001b[33m'\u001b[39m\u001b[33ms output lead to\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1217\u001b[39m             \u001b[33m\"\u001b[39m\u001b[33m an inconsistent number of samples. The output may have Pandas\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   (...)\u001b[39m\u001b[32m   1220\u001b[39m             \u001b[33m\"\u001b[39m\u001b[33m samples.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1221\u001b[39m         )\n\u001b[32m   1223\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m output\n\u001b[32m-> \u001b[39m\u001b[32m1225\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mhstack\u001b[49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\n",
-      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\numpy\\_core\\shape_base.py:364\u001b[39m, in \u001b[36mhstack\u001b[39m\u001b[34m(tup, dtype, casting)\u001b[39m\n\u001b[32m    362\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m _nx.concatenate(arrs, \u001b[32m0\u001b[39m, dtype=dtype, casting=casting)\n\u001b[32m    363\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m364\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_nx\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconcatenate\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m)\u001b[49m\n",
-      "\u001b[31mValueError\u001b[39m: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999"
-     ]
-    }
-   ],
   "source": [
-    "from sklearn.compose import make_column_transformer\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
-    "# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
-    "column_transformer = make_column_transformer(\n",
-    "    (TfidfVectorizer(stop_words='english'), ['detailed_description']),\n",
-    "    (TfidfVectorizer(stop_words='english'), ['about_the_game']),\n",
-    "    (TfidfVectorizer(stop_words='english'), ['short_description']),\n",
-    "    ('passthrough', ['required_age','price','dlc_count','reviews','windows','mac','linux','metacritic_score','achievements','recommendations','notes','supported_languages','full_audio_languages','categories','genres','user_score','score_rank','positive','negative','average_playtime_forever','average_playtime_2weeks','median_playtime_forever','median_playtime_2weeks','discount','peak_ccu','tags','pct_pos_total','num_reviews_total','pct_pos_recent','num_reviews_recent'])\n",
-    ")\n",
    "\n",
-    "dataset = column_transformer.fit_transform(dataset)\n",
-    "print(dataset.head())"
-   ]
+    "vectorizer = TfidfVectorizer()\n",
+    "tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
+    "tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
+    "print(tfidf_df.head())"
+   ],
+   "outputs": [],
+   "execution_count": null
  },
  {
   "cell_type": "markdown",
   "id": "ad84e777",
   "metadata": {},
-   "source": [
-    "\n",
-    "### Removing Bundles\n",
-    "**(TODO: decide whether yes or no), not as important as i thought**\n",
-    "As bundles don't have clear genre(s) defined (e.g. publisher bundles )"
-   ]
+   "source": "With this our feature matrix is completed"
  },
  {
-   "cell_type": "markdown",
-   "id": "6a2a3d4f",
   "metadata": {},
-   "source": [
-    "### Handling missing values\n",
-    "Removing NaN values in the dataSet and setting missing numerical feature values to the mean feature count. Missing Text values are set to a default String `Unknown`."
-   ]
-  },
-  {
   "cell_type": "code",
-   "execution_count": 6,
-   "id": "dea7dc00",
-   "metadata": {},
   "outputs": [],
+   "execution_count": null,
   "source": [
-    "# Setting missing numeric values to the mean\n",
-    "dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
-    "# Setting missing text values to 'Unknown'\n",
-    "dataset.fillna('Unknown', inplace=True)\n",
-    "# Setting missing values in other columns to NaN\n",
-    "dataset.dropna(inplace=True)"
-   ]
+    "X = tfidf_df\n",
+    "y = genres_df"
+   ],
+   "id": "86d9da42f4df8e49"
+  },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "## The Model\n",
+    "\n",
+    "####  Removing unpredicatble Datapoints\n",
+    "Some Datapoints don't have a genre assigned (all feature values in y are 0). The model we use can't handle such cases, thus they have to be removed.\n",
+    "We filter after all values that we can use with a mask, and apply that mask to our matrices."
+   ],
+   "id": "aeb782668f311cd8"
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "mask = y.sum(axis=1).map(lambda x: x > 0)\n",
+    "print((mask == False).sum()) # count of unpredictable datapoints\n",
+    "\n",
+    "X_clean = X[mask]\n",
+    "y_clean = y[mask]"
+   ],
+   "id": "4919bf1b37d171a7"
  },
  {
   "cell_type": "markdown",
   "id": "091d7e13",
   "metadata": {},
   "source": [
-    "# Data Split\n",
-    "Splitting our dataSet to training and testing data. The relation is 80% training and 20% testing data."
+    "# Splitting up data\n",
+    "We have to split up our data into training and testing data.\n",
+    "Using random_state=0 guarantees reproducability."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
   "id": "cfbf3787",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Trainingsdaten: (7999, 33), Testdaten: (2000, 33)\n"
-     ]
+   "metadata": {
+    "jupyter": {
+     "is_executing": true
    }
-   ],
+   },
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
-    "# Setting the target feature 'genres' and dropping it from the dataset\n",
-    "X = dataset.drop('genres', axis=1)\n",
-    "y = dataset['genres']\n",
-    "\n",
-    "X_train, X_test, y_train, y_test = train_test_split(\n",
-    "    X, y, test_size=0.2, random_state=42\n",
-    ")\n",
-    "\n",
-    "print(f\"Training: {X_train.shape}, Testing: {X_test.shape}\")"
-   ]
+    "X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)"
+   ],
+   "outputs": [],
+   "execution_count": null
  },
  {
   "cell_type": "markdown",
@@ -335,17 +289,25 @@
   "metadata": {},
   "source": [
    "# Model Selection\n",
-    "**TODO Deciding which model to use for this task**"
+    "**TODO Deciding which model to use for this task**\n",
+    "\n",
+    "As a game can have multiple genres, our Model(s) has to be capable of multi-label-classification. sklearn's ``MultiOutputClassifier`` can do this. As a backend for ``MultiOutputClassifier`` we use ``LogisticRegression``"
   ]
  },
  {
-   "cell_type": "markdown",
-   "id": "b7795aa1",
   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
   "source": [
-    "### Training\n",
-    "**TODO Train the Selected Model with the training data**"
-   ]
+    "# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is too bad)\n",
+    "multi_target_clf = MultiOutputClassifier(LogisticRegression(max_iter=1337, random_state=0), n_jobs=1)\n",
+    "\n",
+    "multi_target_clf.fit(X_train, y_train)\n",
+    "\n",
+    "y_pred = multi_target_clf.predict(X_test)"
+   ],
+   "id": "8c1d72c4532bd509"
  },
  {
   "cell_type": "markdown",
@@ -356,6 +318,14 @@
    "**TODO Test the Model with the test data**"
   ]
  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": "print(classification_report(y_test, y_pred, zero_division=0.0))",
+   "id": "e2ebea6945193e07"
+  },
  {
   "cell_type": "markdown",
   "id": "2aeb6fc2",
@@ -386,7 +356,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "base",
+   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
@@ -400,7 +370,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.13.5"
+   "version": "3.13.3"
  }
 },
 "nbformat": 4,