jupyter notebook missed some imports

2025-08-13 13:56:30 +02:00
parent 9c3dd33c0b
commit 4b35d4ca21
1 changed files with 224 additions and 72 deletions
--- a/notebook.ipynb
+++ b/notebook.ipynb
@@ -16,12 +16,43 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "id": "3116b75f",
   "metadata": {
    "jupyter": {
     "is_executing": true
    }
   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "   appid              name release_date  required_age  price  dlc_count  \\\n",
+      "0    730  Counter-Strike 2   2012-08-21             0    0.0          1   \n",
+      "\n",
+      "                                detailed_description  \\\n",
+      "0  For over two decades, Counter-Strike has offer...   \n",
+      "\n",
+      "                                      about_the_game  \\\n",
+      "0  For over two decades, Counter-Strike has offer...   \n",
+      "\n",
+      "                                   short_description reviews  ...  \\\n",
+      "0  For over two decades, Counter-Strike has offer...     NaN  ...   \n",
+      "\n",
+      "  average_playtime_2weeks median_playtime_forever median_playtime_2weeks  \\\n",
+      "0                     879                    5174                    350   \n",
+      "\n",
+      "  discount  peak_ccu                                               tags  \\\n",
+      "0        0   1212356  {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'...   \n",
+      "\n",
+      "   pct_pos_total  num_reviews_total pct_pos_recent  num_reviews_recent  \n",
+      "0             86            8632939             82               96473  \n",
+      "\n",
+      "[1 rows x 47 columns]\n"
+     ]
+    }
+   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
@@ -30,10 +61,8 @@
    "set_config(transform_output=\"pandas\")\n",
    "\n",
    "dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n",
-    "print(dataset.head())"
-   ],
-   "outputs": [],
-   "execution_count": null
+    "print(dataset.head(1))"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -60,35 +89,58 @@
   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
+   "execution_count": null,
+   "id": "d159117377f3633c",
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
    "#print(dataset.head())"
-   ],
-   "id": "d159117377f3633c",
-   "outputs": [],
-   "execution_count": null
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "markdown",
+   "id": "e1b28ddd69f1e9a6",
+   "metadata": {},
   "source": [
    "## Hold onto necessary information\n",
    "Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n",
    "We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n",
    "\n",
    "It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed"
-   ],
-   "id": "e1b28ddd69f1e9a6"
+   ]
  },
  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "986fbb31a7ae0d8b",
   "metadata": {
    "jupyter": {
     "is_executing": true
    }
   },
-   "cell_type": "code",
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                                                desc  \\\n",
+      "0  For over two decades, Counter-Strike has offer...   \n",
+      "1  LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...   \n",
+      "2  The most-played game on Steam. Every day, mill...   \n",
+      "3  When a young street hustler, a retired bank ro...   \n",
+      "4  Edition Comparison Ultimate Edition The Tom Cl...   \n",
+      "\n",
+      "                                              genres  \n",
+      "0                         ['Action', 'Free To Play']  \n",
+      "1  ['Action', 'Adventure', 'Massively Multiplayer...  \n",
+      "2             ['Action', 'Strategy', 'Free To Play']  \n",
+      "3                            ['Action', 'Adventure']  \n",
+      "4                                         ['Action']  \n"
+     ]
+    }
+   ],
   "source": [
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.preprocessing import FunctionTransformer\n",
@@ -104,24 +156,24 @@
    ")\n",
    "dataset = column_transformer.fit_transform(dataset)\n",
    "print(dataset.head())"
-   ],
-   "id": "986fbb31a7ae0d8b",
-   "outputs": [],
-   "execution_count": null
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "markdown",
+   "id": "f9b89c0645811564",
+   "metadata": {},
   "source": [
    "### Adding missing Information\n",
    "Some Games might not have any descriptions. For these we Input an Empty String\n",
    "**TODO: check if dropna and fillna numeric_only is needed, as we dont have any numbers**"
-   ],
-   "id": "f9b89c0645811564"
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
+   "execution_count": null,
+   "id": "44239f6b7fd23cde",
+   "metadata": {},
+   "outputs": [],
   "source": [
    "# missing numeric values => mean\n",
    "dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
@@ -129,49 +181,82 @@
    "dataset.fillna('', inplace=True)\n",
    "# drop all lines with missing values\n",
    "dataset.dropna(inplace=True)"
-   ],
-   "id": "44239f6b7fd23cde",
-   "outputs": [],
-   "execution_count": null
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "markdown",
+   "id": "ca5b59b9fa8160a0",
+   "metadata": {},
   "source": [
    "## Transform Genres\n",
    "The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n",
    "\n",
    "#### Serializing the String-Array\n",
    "The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres."
-   ],
-   "id": "ca5b59b9fa8160a0"
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
+   "execution_count": null,
+   "id": "ebc5a24e9bc87fdd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0                               [Action, Free To Play]\n",
+      "1    [Action, Adventure, Massively Multiplayer, Fre...\n",
+      "2                     [Action, Strategy, Free To Play]\n",
+      "3                                  [Action, Adventure]\n",
+      "4                                             [Action]\n",
+      "Name: genres, dtype: object\n"
+     ]
+    }
+   ],
   "source": [
    "import ast\n",
    "\n",
    "dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n",
-    "print(dataset['genres'])"
-   ],
-   "id": "ebc5a24e9bc87fdd",
-   "outputs": [],
-   "execution_count": null
+    "print(dataset['genres'].head())"
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "markdown",
+   "id": "f90756f9ad9211f4",
+   "metadata": {},
   "source": [
    "#### One-Hot-Encoding an Python-Array\n",
    "The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n",
    "Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support."
-   ],
-   "id": "f90756f9ad9211f4"
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
+   "execution_count": null,
+   "id": "d2c3527a5fc876bf",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "   Action  Adventure  Casual  Early Access  Free To Play  Gore  Indie  \\\n",
+      "0       1          0       0             0             1     0      0   \n",
+      "1       1          1       0             0             1     0      0   \n",
+      "2       1          0       0             0             1     0      0   \n",
+      "3       1          1       0             0             0     0      0   \n",
+      "4       1          0       0             0             0     0      0   \n",
+      "\n",
+      "   Massively Multiplayer  RPG  Racing  Simulation  Sports  Strategy  Violent  \n",
+      "0                      0    0       0           0       0         0        0  \n",
+      "1                      1    0       0           0       0         0        0  \n",
+      "2                      0    0       0           0       0         1        0  \n",
+      "3                      0    0       0           0       0         0        0  \n",
+      "4                      0    0       0           0       0         0        0  \n"
+     ]
+    }
+   ],
   "source": [
    "from sklearn.preprocessing import MultiLabelBinarizer\n",
    "\n",
@@ -179,16 +264,15 @@
    "genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n",
    "genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n",
    "print(genres_df.head())"
-   ],
-   "id": "d2c3527a5fc876bf",
-   "outputs": [],
-   "execution_count": null
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "markdown",
-   "source": "With this, our target matrix is completed.",
-   "id": "671c01f9f4ae66d9"
+   "id": "671c01f9f4ae66d9",
+   "metadata": {},
+   "source": [
+    "With this, our target matrix is completed."
+   ]
  },
  {
   "cell_type": "markdown",
@@ -201,8 +285,32 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "id": "4e8b407c",
   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    00  000  000km    000th  00am  00f  00i  00p  00v   01  ...  이터널  이터널리턴  \\\n",
+      "0  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "1  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "2  0.0  0.0    0.0  0.14649   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "3  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "4  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "\n",
+      "   이현준  정대찬  중입니다   철권  토탈워  페르소나  한국어  한글을  \n",
+      "0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
+      "1  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
+      "2  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
+      "3  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
+      "4  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
+      "\n",
+      "[5 rows x 29351 columns]\n"
+     ]
+    }
+   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
@@ -210,52 +318,60 @@
    "tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
    "tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
    "print(tfidf_df.head())"
-   ],
-   "outputs": [],
-   "execution_count": null
+   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad84e777",
   "metadata": {},
-   "source": "With this our feature matrix is completed"
+   "source": [
+    "With this our feature matrix is completed"
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
-   "outputs": [],
   "execution_count": null,
+   "id": "86d9da42f4df8e49",
+   "metadata": {},
+   "outputs": [],
   "source": [
    "X = tfidf_df\n",
    "y = genres_df"
-   ],
-   "id": "86d9da42f4df8e49"
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "markdown",
+   "id": "aeb782668f311cd8",
+   "metadata": {},
   "source": [
    "## The Model\n",
    "\n",
    "####  Removing unpredicatble Datapoints\n",
    "Some Datapoints don't have a genre assigned (all feature values in y are 0). The model we use can't handle such cases, thus they have to be removed.\n",
    "We filter after all values that we can use with a mask, and apply that mask to our matrices."
-   ],
-   "id": "aeb782668f311cd8"
+   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
-   "outputs": [],
   "execution_count": null,
+   "id": "4919bf1b37d171a7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "13\n"
+     ]
+    }
+   ],
   "source": [
    "mask = y.sum(axis=1).map(lambda x: x > 0)\n",
    "print((mask == False).sum()) # count of unpredictable datapoints\n",
    "\n",
    "X_clean = X[mask]\n",
    "y_clean = y[mask]"
-   ],
-   "id": "4919bf1b37d171a7"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -269,19 +385,19 @@
  },
  {
   "cell_type": "code",
+   "execution_count": null,
   "id": "cfbf3787",
   "metadata": {
    "jupyter": {
     "is_executing": true
    }
   },
+   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)"
-   ],
-   "outputs": [],
-   "execution_count": null
+   ]
  },
  {
   "cell_type": "markdown",
@@ -295,19 +411,22 @@
   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
-   "outputs": [],
   "execution_count": null,
+   "id": "8c1d72c4532bd509",
+   "metadata": {},
+   "outputs": [],
   "source": [
-    "# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is too bad)\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.multioutput import MultiOutputClassifier\n",
+    "\n",
+    "# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is to bad)\n",
    "multi_target_clf = MultiOutputClassifier(LogisticRegression(max_iter=1337, random_state=0), n_jobs=1)\n",
    "\n",
    "multi_target_clf.fit(X_train, y_train)\n",
    "\n",
    "y_pred = multi_target_clf.predict(X_test)"
-   ],
-   "id": "8c1d72c4532bd509"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -319,12 +438,45 @@
   ]
  },
  {
-   "metadata": {},
   "cell_type": "code",
-   "outputs": [],
   "execution_count": null,
-   "source": "print(classification_report(y_test, y_pred, zero_division=0.0))",
-   "id": "e2ebea6945193e07"
+   "id": "e2ebea6945193e07",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "              precision    recall  f1-score   support\n",
+      "\n",
+      "           0       0.78      0.91      0.84       300\n",
+      "           1       0.78      0.62      0.69       216\n",
+      "           2       1.00      0.03      0.07        86\n",
+      "           3       0.00      0.00      0.00        46\n",
+      "           4       1.00      0.04      0.07        83\n",
+      "           5       0.00      0.00      0.00         0\n",
+      "           6       0.79      0.81      0.80       245\n",
+      "           7       0.00      0.00      0.00        42\n",
+      "           8       0.90      0.34      0.49       127\n",
+      "           9       0.00      0.00      0.00        12\n",
+      "          10       0.89      0.25      0.39       127\n",
+      "          11       0.00      0.00      0.00        14\n",
+      "          12       0.88      0.14      0.24       106\n",
+      "          13       0.00      0.00      0.00         0\n",
+      "\n",
+      "   micro avg       0.79      0.50      0.61      1404\n",
+      "   macro avg       0.50      0.22      0.26      1404\n",
+      "weighted avg       0.77      0.50      0.53      1404\n",
+      " samples avg       0.77      0.56      0.60      1404\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "from sklearn.metrics import classification_report\n",
+    "\n",
+    "print(classification_report(y_test, y_pred, zero_division=0.0))"
+   ]
  },
  {
   "cell_type": "markdown",