machine-learning/notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a3a7634f",
   "metadata": {},
   "source": [
    "# Machine Learning project in SoSe 2025 at HTW Saar\n",
    "## Idea\n",
    "The goal of this project is getting the genre(s) of a game trough its given metadata\n",
    "\n",
    "## Dataset\n",
    "For our project we use a Steam dataSet from kaggle. You can find it under the following URL: [Kaggle.com](https://www.kaggle.com/datasets/artermiloff/steam-games-dataset/data)\n",
    "\n",
    "### Importing the dataSet\n",
    "The dataSet is imported and added as a variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3116b75f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    appid                             name release_date  required_age  price  \\\n",
      "0     730                 Counter-Strike 2   2012-08-21             0   0.00   \n",
      "1  578080              PUBG: BATTLEGROUNDS   2017-12-21             0   0.00   \n",
      "2     570                           Dota 2   2013-07-09             0   0.00   \n",
      "3  271590        Grand Theft Auto V Legacy   2015-04-13            17   0.00   \n",
      "4  359550  Tom Clancy's Rainbow Six® Siege   2015-12-01            17   3.99   \n",
      "\n",
      "   dlc_count                               detailed_description  \\\n",
      "0          1  For over two decades, Counter-Strike has offer...   \n",
      "1          0  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
      "2          2  The most-played game on Steam. Every day, mill...   \n",
      "3          0  When a young street hustler, a retired bank ro...   \n",
      "4          9  Edition Comparison Ultimate Edition The Tom Cl...   \n",
      "\n",
      "                                      about_the_game  \\\n",
      "0  For over two decades, Counter-Strike has offer...   \n",
      "1  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
      "2  The most-played game on Steam. Every day, mill...   \n",
      "3  When a young street hustler, a retired bank ro...   \n",
      "4  “One of the best first-person shooters ever ma...   \n",
      "\n",
      "                                   short_description  \\\n",
      "0  For over two decades, Counter-Strike has offer...   \n",
      "1  Play PUBG: BATTLEGROUNDS for free. Land on str...   \n",
      "2  Every day, millions of players worldwide enter...   \n",
      "3  Grand Theft Auto V for PC offers players the o...   \n",
      "4  Tom Clancy's Rainbow Six® Siege is an elite, t...   \n",
      "\n",
      "                                             reviews  ...  \\\n",
      "0                                                NaN  ...   \n",
      "1                                                NaN  ...   \n",
      "2  “A modern multiplayer masterpiece.” 9.5/10 – D...  ...   \n",
      "3                                                NaN  ...   \n",
      "4                                                NaN  ...   \n",
      "\n",
      "  average_playtime_2weeks median_playtime_forever median_playtime_2weeks  \\\n",
      "0                     879                    5174                    350   \n",
      "1                       0                       0                      0   \n",
      "2                    1536                     898                    892   \n",
      "3                     771                    7101                     74   \n",
      "4                     682                    2434                    306   \n",
      "\n",
      "  discount  peak_ccu                                               tags  \\\n",
      "0        0   1212356  {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'...   \n",
      "1        0    616738  {'Survival': 14838, 'Shooter': 12727, 'Battle ...   \n",
      "2        0    555977  {'Free to Play': 59933, 'MOBA': 20158, 'Multip...   \n",
      "3        0    117698  {'Open World': 32644, 'Action': 23539, 'Multip...   \n",
      "4       80     89916  {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '...   \n",
      "\n",
      "   pct_pos_total  num_reviews_total pct_pos_recent  num_reviews_recent  \n",
      "0             86            8632939             82               96473  \n",
      "1             59            2513842             68               16720  \n",
      "2             81            2452595             80               29366  \n",
      "3             87            1803832             92               17517  \n",
      "4             84            1168020             76               12608  \n",
      "\n",
      "[5 rows x 47 columns]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# load data\n",
    "# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
    "dataset = pd.read_csv(\"./games_march2025_cleaned_10k.csv\",sep=\",\")\n",
    "print(dataset.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cba9750a",
   "metadata": {},
   "source": [
    "## Preparation of the Training-Set\n",
    "### Removing Uniques\n",
    "We remove the following features from the Training-Set as they can uniquely identify a datapoint:\n",
    "- AppId\n",
    "- Name of the Game\n",
    "- Realease Date\n",
    "- Reviews\n",
    "- Header Image\n",
    "- Website\n",
    "- Support URL\n",
    "- Support Email\n",
    "- MetaCritic URL\n",
    "- Developer\n",
    "- Publisher\n",
    "- Screenshots\n",
    "- Movies\n",
    "- Estimated Owners"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06dedcdf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   required_age  price  dlc_count  \\\n",
      "0             0   0.00          1   \n",
      "1             0   0.00          0   \n",
      "2             0   0.00          2   \n",
      "3            17   0.00          0   \n",
      "4            17   3.99          9   \n",
      "\n",
      "                                detailed_description  \\\n",
      "0  For over two decades, Counter-Strike has offer...   \n",
      "1  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
      "2  The most-played game on Steam. Every day, mill...   \n",
      "3  When a young street hustler, a retired bank ro...   \n",
      "4  Edition Comparison Ultimate Edition The Tom Cl...   \n",
      "\n",
      "                                      about_the_game  \\\n",
      "0  For over two decades, Counter-Strike has offer...   \n",
      "1  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
      "2  The most-played game on Steam. Every day, mill...   \n",
      "3  When a young street hustler, a retired bank ro...   \n",
      "4  “One of the best first-person shooters ever ma...   \n",
      "\n",
      "                                   short_description  \\\n",
      "0  For over two decades, Counter-Strike has offer...   \n",
      "1  Play PUBG: BATTLEGROUNDS for free. Land on str...   \n",
      "2  Every day, millions of players worldwide enter...   \n",
      "3  Grand Theft Auto V for PC offers players the o...   \n",
      "4  Tom Clancy's Rainbow Six® Siege is an elite, t...   \n",
      "\n",
      "                                             reviews  windows    mac  linux  \\\n",
      "0                                                NaN     True  False   True   \n",
      "1                                                NaN     True  False  False   \n",
      "2  “A modern multiplayer masterpiece.” 9.5/10 – D...     True   True   True   \n",
      "3                                                NaN     True  False  False   \n",
      "4                                                NaN     True  False  False   \n",
      "\n",
      "   ...  average_playtime_2weeks  median_playtime_forever  \\\n",
      "0  ...                      879                     5174   \n",
      "1  ...                        0                        0   \n",
      "2  ...                     1536                      898   \n",
      "3  ...                      771                     7101   \n",
      "4  ...                      682                     2434   \n",
      "\n",
      "   median_playtime_2weeks discount peak_ccu  \\\n",
      "0                     350        0  1212356   \n",
      "1                       0        0   616738   \n",
      "2                     892        0   555977   \n",
      "3                      74        0   117698   \n",
      "4                     306       80    89916   \n",
      "\n",
      "                                                tags pct_pos_total  \\\n",
      "0  {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'...            86   \n",
      "1  {'Survival': 14838, 'Shooter': 12727, 'Battle ...            59   \n",
      "2  {'Free to Play': 59933, 'MOBA': 20158, 'Multip...            81   \n",
      "3  {'Open World': 32644, 'Action': 23539, 'Multip...            87   \n",
      "4  {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '...            84   \n",
      "\n",
      "  num_reviews_total pct_pos_recent  num_reviews_recent  \n",
      "0           8632939             82               96473  \n",
      "1           2513842             68               16720  \n",
      "2           2452595             80               29366  \n",
      "3           1803832             92               17517  \n",
      "4           1168020             76               12608  \n",
      "\n",
      "[5 rows x 34 columns]\n"
     ]
    }
   ],
   "source": [
    "# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
    "dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email',\n",
    "              'metacritic_url', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'],\n",
    "              axis=1, inplace=True)\n",
    "print(dataset.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5436c87",
   "metadata": {},
   "source": [
    "### Structurize Text\n",
    "**TODO: check if makes sense**\n",
    "The dataset holds a lot of unstructured data, we use Term Frequency-Inverse Document Frequency to structurize most Text-Features.\n",
    "It is important to use an new Instance for each feature so they don't overlap with each other. \n",
    "\n",
    "### Standardize Values\n",
    "We standardize only the text features to remove the stop words. The dataset allready provides standardized numerical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4e8b407c",
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mValueError\u001b[39m                                Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 11\u001b[39m\n\u001b[32m      3\u001b[39m \u001b[38;5;66;03m# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\u001b[39;00m\n\u001b[32m      4\u001b[39m column_transformer = make_column_transformer(\n\u001b[32m      5\u001b[39m     (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mdetailed_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m      6\u001b[39m     (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mabout_the_game\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m      7\u001b[39m     (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mshort_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m      8\u001b[39m     (\u001b[33m'\u001b[39m\u001b[33mpassthrough\u001b[39m\u001b[33m'\u001b[39m, [\u001b[33m'\u001b[39m\u001b[33mrequired_age\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mprice\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdlc_count\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mreviews\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mwindows\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmac\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mlinux\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmetacritic_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33machievements\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mrecommendations\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnotes\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33msupported_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mfull_audio_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mcategories\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mgenres\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33muser_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mscore_rank\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpositive\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnegative\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdiscount\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpeak_ccu\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mtags\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_recent\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_recent\u001b[39m\u001b[33m'\u001b[39m])\n\u001b[32m      9\u001b[39m )\n\u001b[32m---> \u001b[39m\u001b[32m11\u001b[39m dataset = \u001b[43mcolumn_transformer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdataset\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     12\u001b[39m \u001b[38;5;28mprint\u001b[39m(dataset.head())\n",
      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\utils\\_set_output.py:319\u001b[39m, in \u001b[36m_wrap_method_output.<locals>.wrapped\u001b[39m\u001b[34m(self, X, *args, **kwargs)\u001b[39m\n\u001b[32m    317\u001b[39m \u001b[38;5;129m@wraps\u001b[39m(f)\n\u001b[32m    318\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mwrapped\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m319\u001b[39m     data_to_wrap = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m    320\u001b[39m     \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data_to_wrap, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m    321\u001b[39m         \u001b[38;5;66;03m# only wrap the first output for cross decomposition\u001b[39;00m\n\u001b[32m    322\u001b[39m         return_tuple = (\n\u001b[32m    323\u001b[39m             _wrap_data_with_container(method, data_to_wrap[\u001b[32m0\u001b[39m], X, \u001b[38;5;28mself\u001b[39m),\n\u001b[32m    324\u001b[39m             *data_to_wrap[\u001b[32m1\u001b[39m:],\n\u001b[32m    325\u001b[39m         )\n",
      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m   1382\u001b[39m     estimator._validate_params()\n\u001b[32m   1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m   1385\u001b[39m     skip_parameter_validation=(\n\u001b[32m   1386\u001b[39m         prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m   1387\u001b[39m     )\n\u001b[32m   1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1031\u001b[39m, in \u001b[36mColumnTransformer.fit_transform\u001b[39m\u001b[34m(self, X, y, **params)\u001b[39m\n\u001b[32m   1028\u001b[39m \u001b[38;5;28mself\u001b[39m._validate_output(Xs)\n\u001b[32m   1029\u001b[39m \u001b[38;5;28mself\u001b[39m._record_output_indices(Xs)\n\u001b[32m-> \u001b[39m\u001b[32m1031\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_hstack\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlist\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1225\u001b[39m, in \u001b[36mColumnTransformer._hstack\u001b[39m\u001b[34m(self, Xs, n_samples)\u001b[39m\n\u001b[32m   1215\u001b[39m         \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m   1216\u001b[39m             \u001b[33m\"\u001b[39m\u001b[33mConcatenating DataFrames from the transformer\u001b[39m\u001b[33m'\u001b[39m\u001b[33ms output lead to\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1217\u001b[39m             \u001b[33m\"\u001b[39m\u001b[33m an inconsistent number of samples. The output may have Pandas\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   (...)\u001b[39m\u001b[32m   1220\u001b[39m             \u001b[33m\"\u001b[39m\u001b[33m samples.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m   1221\u001b[39m         )\n\u001b[32m   1223\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m output\n\u001b[32m-> \u001b[39m\u001b[32m1225\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mhstack\u001b[49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\numpy\\_core\\shape_base.py:364\u001b[39m, in \u001b[36mhstack\u001b[39m\u001b[34m(tup, dtype, casting)\u001b[39m\n\u001b[32m    362\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m _nx.concatenate(arrs, \u001b[32m0\u001b[39m, dtype=dtype, casting=casting)\n\u001b[32m    363\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m364\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_nx\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconcatenate\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m)\u001b[49m\n",
      "\u001b[31mValueError\u001b[39m: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999"
     ]
    }
   ],
   "source": [
    "from sklearn.compose import make_column_transformer\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
    "column_transformer = make_column_transformer(\n",
    "    (TfidfVectorizer(stop_words='english'), ['detailed_description']),\n",
    "    (TfidfVectorizer(stop_words='english'), ['about_the_game']),\n",
    "    (TfidfVectorizer(stop_words='english'), ['short_description']),\n",
    "    ('passthrough', ['required_age','price','dlc_count','reviews','windows','mac','linux','metacritic_score','achievements','recommendations','notes','supported_languages','full_audio_languages','categories','genres','user_score','score_rank','positive','negative','average_playtime_forever','average_playtime_2weeks','median_playtime_forever','median_playtime_2weeks','discount','peak_ccu','tags','pct_pos_total','num_reviews_total','pct_pos_recent','num_reviews_recent'])\n",
    ")\n",
    "\n",
    "dataset = column_transformer.fit_transform(dataset)\n",
    "print(dataset.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad84e777",
   "metadata": {},
   "source": [
    "\n",
    "### Removing Bundles\n",
    "**(TODO: decide whether yes or no), not as important as i thought**\n",
    "As bundles don't have clear genre(s) defined (e.g. publisher bundles )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a2a3d4f",
   "metadata": {},
   "source": [
    "### Handling missing values\n",
    "Removing NaN values in the dataSet and setting missing numerical feature values to the mean feature count. Missing Text values are set to a default String `Unknown`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "dea7dc00",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setting missing numeric values to the mean\n",
    "dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
    "# Setting missing text values to 'Unknown'\n",
    "dataset.fillna('Unknown', inplace=True)\n",
    "# Setting missing values in other columns to NaN\n",
    "dataset.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "091d7e13",
   "metadata": {},
   "source": [
    "# Data Split\n",
    "Splitting our dataSet to training and testing data. The relation is 80% training and 20% testing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cfbf3787",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trainingsdaten: (7999, 33), Testdaten: (2000, 33)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Setting the target feature 'genres' and dropping it from the dataset\n",
    "X = dataset.drop('genres', axis=1)\n",
    "y = dataset['genres']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42\n",
    ")\n",
    "\n",
    "print(f\"Training: {X_train.shape}, Testing: {X_test.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12b5283d",
   "metadata": {},
   "source": [
    "# Model Selection\n",
    "**TODO Deciding which model to use for this task**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7795aa1",
   "metadata": {},
   "source": [
    "### Training\n",
    "**TODO Train the Selected Model with the training data**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0faa9856",
   "metadata": {},
   "source": [
    "# Evaluation\n",
    "**TODO Test the Model with the test data**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2aeb6fc2",
   "metadata": {},
   "source": [
    "# Optimization\n",
    "**TODO optimize the model based on the test results**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79b20645",
   "metadata": {},
   "source": [
    "# Validation\n",
    "**TODO Predict actual values**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b709fb7",
   "metadata": {},
   "source": [
    "# Conclusion and outlook\n",
    "**TODO Write a conclusion and outlook what can be done and where the issues were.**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}