jupyter notebook

This commit is contained in:
Tim
2025-08-12 19:09:53 +02:00
parent ac39214e0d
commit 9c3dd33c0b
3 changed files with 226 additions and 261 deletions

View File

@@ -7,103 +7,42 @@
"source": [
"# Machine Learning project in SoSe 2025 at HTW Saar\n",
"## Idea\n",
"The goal of this project is getting the genre(s) of a game trough its given metadata\n",
"The goal of this project is predicting the genre(s) of a game/bundle through its given description(s)\n",
"\n",
"## Dataset\n",
"For our project we use a Steam dataSet from kaggle. You can find it under the following URL: [Kaggle.com](https://www.kaggle.com/datasets/artermiloff/steam-games-dataset/data)\n",
"\n",
"### Importing the dataSet\n",
"The dataSet is imported and added as a variable."
"For our project we use a Steam Dataset provided on moodle, since it has all information we plan on using.\n",
"The Dataset has been cut to only 2000 data points to be runnable on weaker devices."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3116b75f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" appid name release_date required_age price \\\n",
"0 730 Counter-Strike 2 2012-08-21 0 0.00 \n",
"1 578080 PUBG: BATTLEGROUNDS 2017-12-21 0 0.00 \n",
"2 570 Dota 2 2013-07-09 0 0.00 \n",
"3 271590 Grand Theft Auto V Legacy 2015-04-13 17 0.00 \n",
"4 359550 Tom Clancy's Rainbow Six® Siege 2015-12-01 17 3.99 \n",
"\n",
" dlc_count detailed_description \\\n",
"0 1 For over two decades, Counter-Strike has offer... \n",
"1 0 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 2 The most-played game on Steam. Every day, mill... \n",
"3 0 When a young street hustler, a retired bank ro... \n",
"4 9 Edition Comparison Ultimate Edition The Tom Cl... \n",
"\n",
" about_the_game \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 “One of the best first-person shooters ever ma... \n",
"\n",
" short_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
"2 Every day, millions of players worldwide enter... \n",
"3 Grand Theft Auto V for PC offers players the o... \n",
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
"\n",
" reviews ... \\\n",
"0 NaN ... \n",
"1 NaN ... \n",
"2 “A modern multiplayer masterpiece.” 9.5/10 D... ... \n",
"3 NaN ... \n",
"4 NaN ... \n",
"\n",
" average_playtime_2weeks median_playtime_forever median_playtime_2weeks \\\n",
"0 879 5174 350 \n",
"1 0 0 0 \n",
"2 1536 898 892 \n",
"3 771 7101 74 \n",
"4 682 2434 306 \n",
"\n",
" discount peak_ccu tags \\\n",
"0 0 1212356 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... \n",
"1 0 616738 {'Survival': 14838, 'Shooter': 12727, 'Battle ... \n",
"2 0 555977 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... \n",
"3 0 117698 {'Open World': 32644, 'Action': 23539, 'Multip... \n",
"4 80 89916 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... \n",
"\n",
" pct_pos_total num_reviews_total pct_pos_recent num_reviews_recent \n",
"0 86 8632939 82 96473 \n",
"1 59 2513842 68 16720 \n",
"2 81 2452595 80 29366 \n",
"3 87 1803832 92 17517 \n",
"4 84 1168020 76 12608 \n",
"\n",
"[5 rows x 47 columns]\n"
]
"metadata": {
"jupyter": {
"is_executing": true
}
],
},
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn import set_config\n",
"\n",
"# load data\n",
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"dataset = pd.read_csv(\"./games_march2025_cleaned_10k.csv\",sep=\",\")\n",
"set_config(transform_output=\"pandas\")\n",
"\n",
"dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n",
"print(dataset.head())"
]
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "cba9750a",
"metadata": {},
"source": [
"## Preparation of the Training-Set\n",
"## Preparation of the Dataset\n",
"### Removing Uniques\n",
"We remove the following features from the Training-Set as they can uniquely identify a datapoint:\n",
"We would remove the following features from the Training-Set as they can/could uniquely identify a datapoint, but we don't as they will be removed in the next step anyway\n",
"- AppId\n",
"- Name of the Game\n",
"- Realease Date\n",
@@ -121,213 +60,228 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06dedcdf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" required_age price dlc_count \\\n",
"0 0 0.00 1 \n",
"1 0 0.00 0 \n",
"2 0 0.00 2 \n",
"3 17 0.00 0 \n",
"4 17 3.99 9 \n",
"\n",
" detailed_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 Edition Comparison Ultimate Edition The Tom Cl... \n",
"\n",
" about_the_game \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 “One of the best first-person shooters ever ma... \n",
"\n",
" short_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
"2 Every day, millions of players worldwide enter... \n",
"3 Grand Theft Auto V for PC offers players the o... \n",
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
"\n",
" reviews windows mac linux \\\n",
"0 NaN True False True \n",
"1 NaN True False False \n",
"2 “A modern multiplayer masterpiece.” 9.5/10 D... True True True \n",
"3 NaN True False False \n",
"4 NaN True False False \n",
"\n",
" ... average_playtime_2weeks median_playtime_forever \\\n",
"0 ... 879 5174 \n",
"1 ... 0 0 \n",
"2 ... 1536 898 \n",
"3 ... 771 7101 \n",
"4 ... 682 2434 \n",
"\n",
" median_playtime_2weeks discount peak_ccu \\\n",
"0 350 0 1212356 \n",
"1 0 0 616738 \n",
"2 892 0 555977 \n",
"3 74 0 117698 \n",
"4 306 80 89916 \n",
"\n",
" tags pct_pos_total \\\n",
"0 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... 86 \n",
"1 {'Survival': 14838, 'Shooter': 12727, 'Battle ... 59 \n",
"2 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... 81 \n",
"3 {'Open World': 32644, 'Action': 23539, 'Multip... 87 \n",
"4 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... 84 \n",
"\n",
" num_reviews_total pct_pos_recent num_reviews_recent \n",
"0 8632939 82 96473 \n",
"1 2513842 68 16720 \n",
"2 2452595 80 29366 \n",
"3 1803832 92 17517 \n",
"4 1168020 76 12608 \n",
"\n",
"[5 rows x 34 columns]\n"
]
}
],
"cell_type": "code",
"source": [
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email',\n",
" 'metacritic_url', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'],\n",
" axis=1, inplace=True)\n",
"#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
"#print(dataset.head())"
],
"id": "d159117377f3633c",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Hold onto necessary information\n",
"Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n",
"We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n",
"\n",
"It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed"
],
"id": "e1b28ddd69f1e9a6"
},
{
"metadata": {
"jupyter": {
"is_executing": true
}
},
"cell_type": "code",
"source": [
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.preprocessing import FunctionTransformer\n",
"\n",
"# desc, genres\n",
"column_transformer = ColumnTransformer([\n",
" # merge all descriptions\n",
" ('desc', FunctionTransformer(lambda X: X.fillna('').agg(' '.join, axis=1).to_frame(name=\"desc\")),\n",
" ['detailed_description', 'about_the_game', 'short_description']),\n",
" ('pass', 'passthrough', ['genres']),\n",
" ],\n",
" verbose_feature_names_out=False\n",
")\n",
"dataset = column_transformer.fit_transform(dataset)\n",
"print(dataset.head())"
]
],
"id": "986fbb31a7ae0d8b",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"### Adding missing Information\n",
"Some Games might not have any descriptions. For these we Input an Empty String\n",
"**TODO: check if dropna and fillna numeric_only is needed, as we dont have any numbers**"
],
"id": "f9b89c0645811564"
},
{
"metadata": {},
"cell_type": "code",
"source": [
"# missing numeric values => mean\n",
"dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
"# missing strings => empty string?\n",
"dataset.fillna('', inplace=True)\n",
"# drop all lines with missing values\n",
"dataset.dropna(inplace=True)"
],
"id": "44239f6b7fd23cde",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Transform Genres\n",
"The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n",
"\n",
"#### Serializing the String-Array\n",
"The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres."
],
"id": "ca5b59b9fa8160a0"
},
{
"metadata": {},
"cell_type": "code",
"source": [
"import ast\n",
"\n",
"dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n",
"print(dataset['genres'])"
],
"id": "ebc5a24e9bc87fdd",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"#### One-Hot-Encoding an Python-Array\n",
"The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n",
"Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support."
],
"id": "f90756f9ad9211f4"
},
{
"metadata": {},
"cell_type": "code",
"source": [
"from sklearn.preprocessing import MultiLabelBinarizer\n",
"\n",
"mlb_genres = MultiLabelBinarizer()\n",
"genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n",
"genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n",
"print(genres_df.head())"
],
"id": "d2c3527a5fc876bf",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": "With this, our target matrix is completed.",
"id": "671c01f9f4ae66d9"
},
{
"cell_type": "markdown",
"id": "f5436c87",
"metadata": {},
"source": [
"### Structurize Text\n",
"**TODO: check if makes sense**\n",
"The dataset holds a lot of unstructured data, we use Term Frequency-Inverse Document Frequency to structurize most Text-Features.\n",
"It is important to use an new Instance for each feature so they don't overlap with each other. \n",
"\n",
"### Standardize Values\n",
"We standardize only the text features to remove the stop words. The dataset allready provides standardized numerical features."
"### Structurizing Text\n",
"If we want our Model to be able to use text as an input, we have to vectorize the text. TF-IDF (Inverse Document Frequency) is an easy way of transforming each word into a feature with a 0 to 1 value. **TODO: filter out stopwords**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4e8b407c",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 11\u001b[39m\n\u001b[32m 3\u001b[39m \u001b[38;5;66;03m# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\u001b[39;00m\n\u001b[32m 4\u001b[39m column_transformer = make_column_transformer(\n\u001b[32m 5\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mdetailed_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 6\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mabout_the_game\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 7\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mshort_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 8\u001b[39m (\u001b[33m'\u001b[39m\u001b[33mpassthrough\u001b[39m\u001b[33m'\u001b[39m, [\u001b[33m'\u001b[39m\u001b[33mrequired_age\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mprice\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdlc_count\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mreviews\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mwindows\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmac\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mlinux\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmetacritic_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33machievements\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mrecommendations\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnotes\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33msupported_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mfull_audio_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mcategories\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mgenres\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33muser_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mscore_rank\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpositive\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnegative\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdiscount\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpeak_ccu\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mtags\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_recent\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_recent\u001b[39m\u001b[33m'\u001b[39m])\n\u001b[32m 9\u001b[39m )\n\u001b[32m---> \u001b[39m\u001b[32m11\u001b[39m dataset = \u001b[43mcolumn_transformer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdataset\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 12\u001b[39m \u001b[38;5;28mprint\u001b[39m(dataset.head())\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\utils\\_set_output.py:319\u001b[39m, in \u001b[36m_wrap_method_output.<locals>.wrapped\u001b[39m\u001b[34m(self, X, *args, **kwargs)\u001b[39m\n\u001b[32m 317\u001b[39m \u001b[38;5;129m@wraps\u001b[39m(f)\n\u001b[32m 318\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mwrapped\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m319\u001b[39m data_to_wrap = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 320\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data_to_wrap, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 321\u001b[39m \u001b[38;5;66;03m# only wrap the first output for cross decomposition\u001b[39;00m\n\u001b[32m 322\u001b[39m return_tuple = (\n\u001b[32m 323\u001b[39m _wrap_data_with_container(method, data_to_wrap[\u001b[32m0\u001b[39m], X, \u001b[38;5;28mself\u001b[39m),\n\u001b[32m 324\u001b[39m *data_to_wrap[\u001b[32m1\u001b[39m:],\n\u001b[32m 325\u001b[39m )\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m 1382\u001b[39m estimator._validate_params()\n\u001b[32m 1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m 1385\u001b[39m skip_parameter_validation=(\n\u001b[32m 1386\u001b[39m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m 1387\u001b[39m )\n\u001b[32m 1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1031\u001b[39m, in \u001b[36mColumnTransformer.fit_transform\u001b[39m\u001b[34m(self, X, y, **params)\u001b[39m\n\u001b[32m 1028\u001b[39m \u001b[38;5;28mself\u001b[39m._validate_output(Xs)\n\u001b[32m 1029\u001b[39m \u001b[38;5;28mself\u001b[39m._record_output_indices(Xs)\n\u001b[32m-> \u001b[39m\u001b[32m1031\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_hstack\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlist\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1225\u001b[39m, in \u001b[36mColumnTransformer._hstack\u001b[39m\u001b[34m(self, Xs, n_samples)\u001b[39m\n\u001b[32m 1215\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 1216\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mConcatenating DataFrames from the transformer\u001b[39m\u001b[33m'\u001b[39m\u001b[33ms output lead to\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1217\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m an inconsistent number of samples. The output may have Pandas\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m (...)\u001b[39m\u001b[32m 1220\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m samples.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1221\u001b[39m )\n\u001b[32m 1223\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m output\n\u001b[32m-> \u001b[39m\u001b[32m1225\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mhstack\u001b[49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\numpy\\_core\\shape_base.py:364\u001b[39m, in \u001b[36mhstack\u001b[39m\u001b[34m(tup, dtype, casting)\u001b[39m\n\u001b[32m 362\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m _nx.concatenate(arrs, \u001b[32m0\u001b[39m, dtype=dtype, casting=casting)\n\u001b[32m 363\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m364\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_nx\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconcatenate\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[31mValueError\u001b[39m: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999"
]
}
],
"source": [
"from sklearn.compose import make_column_transformer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"column_transformer = make_column_transformer(\n",
" (TfidfVectorizer(stop_words='english'), ['detailed_description']),\n",
" (TfidfVectorizer(stop_words='english'), ['about_the_game']),\n",
" (TfidfVectorizer(stop_words='english'), ['short_description']),\n",
" ('passthrough', ['required_age','price','dlc_count','reviews','windows','mac','linux','metacritic_score','achievements','recommendations','notes','supported_languages','full_audio_languages','categories','genres','user_score','score_rank','positive','negative','average_playtime_forever','average_playtime_2weeks','median_playtime_forever','median_playtime_2weeks','discount','peak_ccu','tags','pct_pos_total','num_reviews_total','pct_pos_recent','num_reviews_recent'])\n",
")\n",
"\n",
"dataset = column_transformer.fit_transform(dataset)\n",
"print(dataset.head())"
]
"vectorizer = TfidfVectorizer()\n",
"tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
"tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
"print(tfidf_df.head())"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "ad84e777",
"metadata": {},
"source": [
"\n",
"### Removing Bundles\n",
"**(TODO: decide whether yes or no), not as important as i thought**\n",
"As bundles don't have clear genre(s) defined (e.g. publisher bundles )"
]
"source": "With this our feature matrix is completed"
},
{
"cell_type": "markdown",
"id": "6a2a3d4f",
"metadata": {},
"source": [
"### Handling missing values\n",
"Removing NaN values in the dataSet and setting missing numerical feature values to the mean feature count. Missing Text values are set to a default String `Unknown`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "dea7dc00",
"metadata": {},
"outputs": [],
"execution_count": null,
"source": [
"# Setting missing numeric values to the mean\n",
"dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
"# Setting missing text values to 'Unknown'\n",
"dataset.fillna('Unknown', inplace=True)\n",
"# Setting missing values in other columns to NaN\n",
"dataset.dropna(inplace=True)"
]
"X = tfidf_df\n",
"y = genres_df"
],
"id": "86d9da42f4df8e49"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## The Model\n",
"\n",
"#### Removing unpredicatble Datapoints\n",
"Some Datapoints don't have a genre assigned (all feature values in y are 0). The model we use can't handle such cases, thus they have to be removed.\n",
"We filter after all values that we can use with a mask, and apply that mask to our matrices."
],
"id": "aeb782668f311cd8"
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"mask = y.sum(axis=1).map(lambda x: x > 0)\n",
"print((mask == False).sum()) # count of unpredictable datapoints\n",
"\n",
"X_clean = X[mask]\n",
"y_clean = y[mask]"
],
"id": "4919bf1b37d171a7"
},
{
"cell_type": "markdown",
"id": "091d7e13",
"metadata": {},
"source": [
"# Data Split\n",
"Splitting our dataSet to training and testing data. The relation is 80% training and 20% testing data."
"# Splitting up data\n",
"We have to split up our data into training and testing data.\n",
"Using random_state=0 guarantees reproducability."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cfbf3787",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Trainingsdaten: (7999, 33), Testdaten: (2000, 33)\n"
]
"metadata": {
"jupyter": {
"is_executing": true
}
],
},
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Setting the target feature 'genres' and dropping it from the dataset\n",
"X = dataset.drop('genres', axis=1)\n",
"y = dataset['genres']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, test_size=0.2, random_state=42\n",
")\n",
"\n",
"print(f\"Training: {X_train.shape}, Testing: {X_test.shape}\")"
]
"X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
@@ -335,17 +289,25 @@
"metadata": {},
"source": [
"# Model Selection\n",
"**TODO Deciding which model to use for this task**"
"**TODO Deciding which model to use for this task**\n",
"\n",
"As a game can have multiple genres, our Model(s) has to be capable of multi-label-classification. sklearn's ``MultiOutputClassifier`` can do this. As a backend for ``MultiOutputClassifier`` we use ``LogisticRegression``"
]
},
{
"cell_type": "markdown",
"id": "b7795aa1",
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"### Training\n",
"**TODO Train the Selected Model with the training data**"
]
"# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is too bad)\n",
"multi_target_clf = MultiOutputClassifier(LogisticRegression(max_iter=1337, random_state=0), n_jobs=1)\n",
"\n",
"multi_target_clf.fit(X_train, y_train)\n",
"\n",
"y_pred = multi_target_clf.predict(X_test)"
],
"id": "8c1d72c4532bd509"
},
{
"cell_type": "markdown",
@@ -356,6 +318,14 @@
"**TODO Test the Model with the test data**"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": "print(classification_report(y_test, y_pred, zero_division=0.0))",
"id": "e2ebea6945193e07"
},
{
"cell_type": "markdown",
"id": "2aeb6fc2",
@@ -386,7 +356,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "base",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
@@ -400,7 +370,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
"version": "3.13.3"
}
},
"nbformat": 4,