Files
machine-learning/notebook.ipynb
2025-08-11 23:45:16 +02:00

409 lines
27 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "a3a7634f",
"metadata": {},
"source": [
"# Machine Learning project in SoSe 2025 at HTW Saar\n",
"## Idea\n",
"The goal of this project is getting the genre(s) of a game trough its given metadata\n",
"\n",
"## Dataset\n",
"For our project we use a Steam dataSet from kaggle. You can find it under the following URL: [Kaggle.com](https://www.kaggle.com/datasets/artermiloff/steam-games-dataset/data)\n",
"\n",
"### Importing the dataSet\n",
"The dataSet is imported and added as a variable."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3116b75f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" appid name release_date required_age price \\\n",
"0 730 Counter-Strike 2 2012-08-21 0 0.00 \n",
"1 578080 PUBG: BATTLEGROUNDS 2017-12-21 0 0.00 \n",
"2 570 Dota 2 2013-07-09 0 0.00 \n",
"3 271590 Grand Theft Auto V Legacy 2015-04-13 17 0.00 \n",
"4 359550 Tom Clancy's Rainbow Six® Siege 2015-12-01 17 3.99 \n",
"\n",
" dlc_count detailed_description \\\n",
"0 1 For over two decades, Counter-Strike has offer... \n",
"1 0 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 2 The most-played game on Steam. Every day, mill... \n",
"3 0 When a young street hustler, a retired bank ro... \n",
"4 9 Edition Comparison Ultimate Edition The Tom Cl... \n",
"\n",
" about_the_game \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 “One of the best first-person shooters ever ma... \n",
"\n",
" short_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
"2 Every day, millions of players worldwide enter... \n",
"3 Grand Theft Auto V for PC offers players the o... \n",
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
"\n",
" reviews ... \\\n",
"0 NaN ... \n",
"1 NaN ... \n",
"2 “A modern multiplayer masterpiece.” 9.5/10 D... ... \n",
"3 NaN ... \n",
"4 NaN ... \n",
"\n",
" average_playtime_2weeks median_playtime_forever median_playtime_2weeks \\\n",
"0 879 5174 350 \n",
"1 0 0 0 \n",
"2 1536 898 892 \n",
"3 771 7101 74 \n",
"4 682 2434 306 \n",
"\n",
" discount peak_ccu tags \\\n",
"0 0 1212356 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... \n",
"1 0 616738 {'Survival': 14838, 'Shooter': 12727, 'Battle ... \n",
"2 0 555977 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... \n",
"3 0 117698 {'Open World': 32644, 'Action': 23539, 'Multip... \n",
"4 80 89916 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... \n",
"\n",
" pct_pos_total num_reviews_total pct_pos_recent num_reviews_recent \n",
"0 86 8632939 82 96473 \n",
"1 59 2513842 68 16720 \n",
"2 81 2452595 80 29366 \n",
"3 87 1803832 92 17517 \n",
"4 84 1168020 76 12608 \n",
"\n",
"[5 rows x 47 columns]\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"# load data\n",
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"dataset = pd.read_csv(\"./games_march2025_cleaned_10k.csv\",sep=\",\")\n",
"print(dataset.head())"
]
},
{
"cell_type": "markdown",
"id": "cba9750a",
"metadata": {},
"source": [
"## Preparation of the Training-Set\n",
"### Removing Uniques\n",
"We remove the following features from the Training-Set as they can uniquely identify a datapoint:\n",
"- AppId\n",
"- Name of the Game\n",
"- Realease Date\n",
"- Reviews\n",
"- Header Image\n",
"- Website\n",
"- Support URL\n",
"- Support Email\n",
"- MetaCritic URL\n",
"- Developer\n",
"- Publisher\n",
"- Screenshots\n",
"- Movies\n",
"- Estimated Owners"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06dedcdf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" required_age price dlc_count \\\n",
"0 0 0.00 1 \n",
"1 0 0.00 0 \n",
"2 0 0.00 2 \n",
"3 17 0.00 0 \n",
"4 17 3.99 9 \n",
"\n",
" detailed_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 Edition Comparison Ultimate Edition The Tom Cl... \n",
"\n",
" about_the_game \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 “One of the best first-person shooters ever ma... \n",
"\n",
" short_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
"2 Every day, millions of players worldwide enter... \n",
"3 Grand Theft Auto V for PC offers players the o... \n",
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
"\n",
" reviews windows mac linux \\\n",
"0 NaN True False True \n",
"1 NaN True False False \n",
"2 “A modern multiplayer masterpiece.” 9.5/10 D... True True True \n",
"3 NaN True False False \n",
"4 NaN True False False \n",
"\n",
" ... average_playtime_2weeks median_playtime_forever \\\n",
"0 ... 879 5174 \n",
"1 ... 0 0 \n",
"2 ... 1536 898 \n",
"3 ... 771 7101 \n",
"4 ... 682 2434 \n",
"\n",
" median_playtime_2weeks discount peak_ccu \\\n",
"0 350 0 1212356 \n",
"1 0 0 616738 \n",
"2 892 0 555977 \n",
"3 74 0 117698 \n",
"4 306 80 89916 \n",
"\n",
" tags pct_pos_total \\\n",
"0 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... 86 \n",
"1 {'Survival': 14838, 'Shooter': 12727, 'Battle ... 59 \n",
"2 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... 81 \n",
"3 {'Open World': 32644, 'Action': 23539, 'Multip... 87 \n",
"4 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... 84 \n",
"\n",
" num_reviews_total pct_pos_recent num_reviews_recent \n",
"0 8632939 82 96473 \n",
"1 2513842 68 16720 \n",
"2 2452595 80 29366 \n",
"3 1803832 92 17517 \n",
"4 1168020 76 12608 \n",
"\n",
"[5 rows x 34 columns]\n"
]
}
],
"source": [
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email',\n",
" 'metacritic_url', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'],\n",
" axis=1, inplace=True)\n",
"print(dataset.head())"
]
},
{
"cell_type": "markdown",
"id": "f5436c87",
"metadata": {},
"source": [
"### Structurize Text\n",
"**TODO: check if makes sense**\n",
"The dataset holds a lot of unstructured data, we use Term Frequency-Inverse Document Frequency to structurize most Text-Features.\n",
"It is important to use an new Instance for each feature so they don't overlap with each other. \n",
"\n",
"### Standardize Values\n",
"We standardize only the text features to remove the stop words. The dataset allready provides standardized numerical features."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4e8b407c",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 11\u001b[39m\n\u001b[32m 3\u001b[39m \u001b[38;5;66;03m# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\u001b[39;00m\n\u001b[32m 4\u001b[39m column_transformer = make_column_transformer(\n\u001b[32m 5\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mdetailed_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 6\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mabout_the_game\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 7\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mshort_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 8\u001b[39m (\u001b[33m'\u001b[39m\u001b[33mpassthrough\u001b[39m\u001b[33m'\u001b[39m, [\u001b[33m'\u001b[39m\u001b[33mrequired_age\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mprice\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdlc_count\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mreviews\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mwindows\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmac\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mlinux\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmetacritic_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33machievements\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mrecommendations\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnotes\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33msupported_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mfull_audio_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mcategories\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mgenres\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33muser_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mscore_rank\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpositive\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnegative\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdiscount\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpeak_ccu\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mtags\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_recent\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_recent\u001b[39m\u001b[33m'\u001b[39m])\n\u001b[32m 9\u001b[39m )\n\u001b[32m---> \u001b[39m\u001b[32m11\u001b[39m dataset = \u001b[43mcolumn_transformer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdataset\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 12\u001b[39m \u001b[38;5;28mprint\u001b[39m(dataset.head())\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\utils\\_set_output.py:319\u001b[39m, in \u001b[36m_wrap_method_output.<locals>.wrapped\u001b[39m\u001b[34m(self, X, *args, **kwargs)\u001b[39m\n\u001b[32m 317\u001b[39m \u001b[38;5;129m@wraps\u001b[39m(f)\n\u001b[32m 318\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mwrapped\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m319\u001b[39m data_to_wrap = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 320\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data_to_wrap, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 321\u001b[39m \u001b[38;5;66;03m# only wrap the first output for cross decomposition\u001b[39;00m\n\u001b[32m 322\u001b[39m return_tuple = (\n\u001b[32m 323\u001b[39m _wrap_data_with_container(method, data_to_wrap[\u001b[32m0\u001b[39m], X, \u001b[38;5;28mself\u001b[39m),\n\u001b[32m 324\u001b[39m *data_to_wrap[\u001b[32m1\u001b[39m:],\n\u001b[32m 325\u001b[39m )\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m 1382\u001b[39m estimator._validate_params()\n\u001b[32m 1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m 1385\u001b[39m skip_parameter_validation=(\n\u001b[32m 1386\u001b[39m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m 1387\u001b[39m )\n\u001b[32m 1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1031\u001b[39m, in \u001b[36mColumnTransformer.fit_transform\u001b[39m\u001b[34m(self, X, y, **params)\u001b[39m\n\u001b[32m 1028\u001b[39m \u001b[38;5;28mself\u001b[39m._validate_output(Xs)\n\u001b[32m 1029\u001b[39m \u001b[38;5;28mself\u001b[39m._record_output_indices(Xs)\n\u001b[32m-> \u001b[39m\u001b[32m1031\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_hstack\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlist\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1225\u001b[39m, in \u001b[36mColumnTransformer._hstack\u001b[39m\u001b[34m(self, Xs, n_samples)\u001b[39m\n\u001b[32m 1215\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 1216\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mConcatenating DataFrames from the transformer\u001b[39m\u001b[33m'\u001b[39m\u001b[33ms output lead to\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1217\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m an inconsistent number of samples. The output may have Pandas\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m (...)\u001b[39m\u001b[32m 1220\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m samples.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1221\u001b[39m )\n\u001b[32m 1223\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m output\n\u001b[32m-> \u001b[39m\u001b[32m1225\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mhstack\u001b[49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\numpy\\_core\\shape_base.py:364\u001b[39m, in \u001b[36mhstack\u001b[39m\u001b[34m(tup, dtype, casting)\u001b[39m\n\u001b[32m 362\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m _nx.concatenate(arrs, \u001b[32m0\u001b[39m, dtype=dtype, casting=casting)\n\u001b[32m 363\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m364\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_nx\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconcatenate\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[31mValueError\u001b[39m: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999"
]
}
],
"source": [
"from sklearn.compose import make_column_transformer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"column_transformer = make_column_transformer(\n",
" (TfidfVectorizer(stop_words='english'), ['detailed_description']),\n",
" (TfidfVectorizer(stop_words='english'), ['about_the_game']),\n",
" (TfidfVectorizer(stop_words='english'), ['short_description']),\n",
" ('passthrough', ['required_age','price','dlc_count','reviews','windows','mac','linux','metacritic_score','achievements','recommendations','notes','supported_languages','full_audio_languages','categories','genres','user_score','score_rank','positive','negative','average_playtime_forever','average_playtime_2weeks','median_playtime_forever','median_playtime_2weeks','discount','peak_ccu','tags','pct_pos_total','num_reviews_total','pct_pos_recent','num_reviews_recent'])\n",
")\n",
"\n",
"dataset = column_transformer.fit_transform(dataset)\n",
"print(dataset.head())"
]
},
{
"cell_type": "markdown",
"id": "ad84e777",
"metadata": {},
"source": [
"\n",
"### Removing Bundles\n",
"**(TODO: decide whether yes or no), not as important as i thought**\n",
"As bundles don't have clear genre(s) defined (e.g. publisher bundles )"
]
},
{
"cell_type": "markdown",
"id": "6a2a3d4f",
"metadata": {},
"source": [
"### Handling missing values\n",
"Removing NaN values in the dataSet and setting missing numerical feature values to the mean feature count. Missing Text values are set to a default String `Unknown`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "dea7dc00",
"metadata": {},
"outputs": [],
"source": [
"# Setting missing numeric values to the mean\n",
"dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
"# Setting missing text values to 'Unknown'\n",
"dataset.fillna('Unknown', inplace=True)\n",
"# Setting missing values in other columns to NaN\n",
"dataset.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "091d7e13",
"metadata": {},
"source": [
"# Data Split\n",
"Splitting our dataSet to training and testing data. The relation is 80% training and 20% testing data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cfbf3787",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Trainingsdaten: (7999, 33), Testdaten: (2000, 33)\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Setting the target feature 'genres' and dropping it from the dataset\n",
"X = dataset.drop('genres', axis=1)\n",
"y = dataset['genres']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, test_size=0.2, random_state=42\n",
")\n",
"\n",
"print(f\"Training: {X_train.shape}, Testing: {X_test.shape}\")"
]
},
{
"cell_type": "markdown",
"id": "12b5283d",
"metadata": {},
"source": [
"# Model Selection\n",
"**TODO Deciding which model to use for this task**"
]
},
{
"cell_type": "markdown",
"id": "b7795aa1",
"metadata": {},
"source": [
"### Training\n",
"**TODO Train the Selected Model with the training data**"
]
},
{
"cell_type": "markdown",
"id": "0faa9856",
"metadata": {},
"source": [
"# Evaluation\n",
"**TODO Test the Model with the test data**"
]
},
{
"cell_type": "markdown",
"id": "2aeb6fc2",
"metadata": {},
"source": [
"# Optimization\n",
"**TODO optimize the model based on the test results**"
]
},
{
"cell_type": "markdown",
"id": "79b20645",
"metadata": {},
"source": [
"# Validation\n",
"**TODO Predict actual values**"
]
},
{
"cell_type": "markdown",
"id": "3b709fb7",
"metadata": {},
"source": [
"# Conclusion and outlook\n",
"**TODO Write a conclusion and outlook what can be done and where the issues were.**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}