409 lines
27 KiB
Plaintext
409 lines
27 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a3a7634f",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Machine Learning project in SoSe 2025 at HTW Saar\n",
|
||
"## Idea\n",
|
||
"The goal of this project is getting the genre(s) of a game trough its given metadata\n",
|
||
"\n",
|
||
"## Dataset\n",
|
||
"For our project we use a Steam dataSet from kaggle. You can find it under the following URL: [Kaggle.com](https://www.kaggle.com/datasets/artermiloff/steam-games-dataset/data)\n",
|
||
"\n",
|
||
"### Importing the dataSet\n",
|
||
"The dataSet is imported and added as a variable."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "3116b75f",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" appid name release_date required_age price \\\n",
|
||
"0 730 Counter-Strike 2 2012-08-21 0 0.00 \n",
|
||
"1 578080 PUBG: BATTLEGROUNDS 2017-12-21 0 0.00 \n",
|
||
"2 570 Dota 2 2013-07-09 0 0.00 \n",
|
||
"3 271590 Grand Theft Auto V Legacy 2015-04-13 17 0.00 \n",
|
||
"4 359550 Tom Clancy's Rainbow Six® Siege 2015-12-01 17 3.99 \n",
|
||
"\n",
|
||
" dlc_count detailed_description \\\n",
|
||
"0 1 For over two decades, Counter-Strike has offer... \n",
|
||
"1 0 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
|
||
"2 2 The most-played game on Steam. Every day, mill... \n",
|
||
"3 0 When a young street hustler, a retired bank ro... \n",
|
||
"4 9 Edition Comparison Ultimate Edition The Tom Cl... \n",
|
||
"\n",
|
||
" about_the_game \\\n",
|
||
"0 For over two decades, Counter-Strike has offer... \n",
|
||
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
|
||
"2 The most-played game on Steam. Every day, mill... \n",
|
||
"3 When a young street hustler, a retired bank ro... \n",
|
||
"4 “One of the best first-person shooters ever ma... \n",
|
||
"\n",
|
||
" short_description \\\n",
|
||
"0 For over two decades, Counter-Strike has offer... \n",
|
||
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
|
||
"2 Every day, millions of players worldwide enter... \n",
|
||
"3 Grand Theft Auto V for PC offers players the o... \n",
|
||
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
|
||
"\n",
|
||
" reviews ... \\\n",
|
||
"0 NaN ... \n",
|
||
"1 NaN ... \n",
|
||
"2 “A modern multiplayer masterpiece.” 9.5/10 – D... ... \n",
|
||
"3 NaN ... \n",
|
||
"4 NaN ... \n",
|
||
"\n",
|
||
" average_playtime_2weeks median_playtime_forever median_playtime_2weeks \\\n",
|
||
"0 879 5174 350 \n",
|
||
"1 0 0 0 \n",
|
||
"2 1536 898 892 \n",
|
||
"3 771 7101 74 \n",
|
||
"4 682 2434 306 \n",
|
||
"\n",
|
||
" discount peak_ccu tags \\\n",
|
||
"0 0 1212356 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... \n",
|
||
"1 0 616738 {'Survival': 14838, 'Shooter': 12727, 'Battle ... \n",
|
||
"2 0 555977 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... \n",
|
||
"3 0 117698 {'Open World': 32644, 'Action': 23539, 'Multip... \n",
|
||
"4 80 89916 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... \n",
|
||
"\n",
|
||
" pct_pos_total num_reviews_total pct_pos_recent num_reviews_recent \n",
|
||
"0 86 8632939 82 96473 \n",
|
||
"1 59 2513842 68 16720 \n",
|
||
"2 81 2452595 80 29366 \n",
|
||
"3 87 1803832 92 17517 \n",
|
||
"4 84 1168020 76 12608 \n",
|
||
"\n",
|
||
"[5 rows x 47 columns]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import pandas as pd\n",
|
||
"\n",
|
||
"# load data\n",
|
||
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
|
||
"dataset = pd.read_csv(\"./games_march2025_cleaned_10k.csv\",sep=\",\")\n",
|
||
"print(dataset.head())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cba9750a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Preparation of the Training-Set\n",
|
||
"### Removing Uniques\n",
|
||
"We remove the following features from the Training-Set as they can uniquely identify a datapoint:\n",
|
||
"- AppId\n",
|
||
"- Name of the Game\n",
|
||
"- Realease Date\n",
|
||
"- Reviews\n",
|
||
"- Header Image\n",
|
||
"- Website\n",
|
||
"- Support URL\n",
|
||
"- Support Email\n",
|
||
"- MetaCritic URL\n",
|
||
"- Developer\n",
|
||
"- Publisher\n",
|
||
"- Screenshots\n",
|
||
"- Movies\n",
|
||
"- Estimated Owners"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "06dedcdf",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" required_age price dlc_count \\\n",
|
||
"0 0 0.00 1 \n",
|
||
"1 0 0.00 0 \n",
|
||
"2 0 0.00 2 \n",
|
||
"3 17 0.00 0 \n",
|
||
"4 17 3.99 9 \n",
|
||
"\n",
|
||
" detailed_description \\\n",
|
||
"0 For over two decades, Counter-Strike has offer... \n",
|
||
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
|
||
"2 The most-played game on Steam. Every day, mill... \n",
|
||
"3 When a young street hustler, a retired bank ro... \n",
|
||
"4 Edition Comparison Ultimate Edition The Tom Cl... \n",
|
||
"\n",
|
||
" about_the_game \\\n",
|
||
"0 For over two decades, Counter-Strike has offer... \n",
|
||
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
|
||
"2 The most-played game on Steam. Every day, mill... \n",
|
||
"3 When a young street hustler, a retired bank ro... \n",
|
||
"4 “One of the best first-person shooters ever ma... \n",
|
||
"\n",
|
||
" short_description \\\n",
|
||
"0 For over two decades, Counter-Strike has offer... \n",
|
||
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
|
||
"2 Every day, millions of players worldwide enter... \n",
|
||
"3 Grand Theft Auto V for PC offers players the o... \n",
|
||
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
|
||
"\n",
|
||
" reviews windows mac linux \\\n",
|
||
"0 NaN True False True \n",
|
||
"1 NaN True False False \n",
|
||
"2 “A modern multiplayer masterpiece.” 9.5/10 – D... True True True \n",
|
||
"3 NaN True False False \n",
|
||
"4 NaN True False False \n",
|
||
"\n",
|
||
" ... average_playtime_2weeks median_playtime_forever \\\n",
|
||
"0 ... 879 5174 \n",
|
||
"1 ... 0 0 \n",
|
||
"2 ... 1536 898 \n",
|
||
"3 ... 771 7101 \n",
|
||
"4 ... 682 2434 \n",
|
||
"\n",
|
||
" median_playtime_2weeks discount peak_ccu \\\n",
|
||
"0 350 0 1212356 \n",
|
||
"1 0 0 616738 \n",
|
||
"2 892 0 555977 \n",
|
||
"3 74 0 117698 \n",
|
||
"4 306 80 89916 \n",
|
||
"\n",
|
||
" tags pct_pos_total \\\n",
|
||
"0 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... 86 \n",
|
||
"1 {'Survival': 14838, 'Shooter': 12727, 'Battle ... 59 \n",
|
||
"2 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... 81 \n",
|
||
"3 {'Open World': 32644, 'Action': 23539, 'Multip... 87 \n",
|
||
"4 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... 84 \n",
|
||
"\n",
|
||
" num_reviews_total pct_pos_recent num_reviews_recent \n",
|
||
"0 8632939 82 96473 \n",
|
||
"1 2513842 68 16720 \n",
|
||
"2 2452595 80 29366 \n",
|
||
"3 1803832 92 17517 \n",
|
||
"4 1168020 76 12608 \n",
|
||
"\n",
|
||
"[5 rows x 34 columns]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
|
||
"dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email',\n",
|
||
" 'metacritic_url', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'],\n",
|
||
" axis=1, inplace=True)\n",
|
||
"print(dataset.head())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f5436c87",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Structurize Text\n",
|
||
"**TODO: check if makes sense**\n",
|
||
"The dataset holds a lot of unstructured data, we use Term Frequency-Inverse Document Frequency to structurize most Text-Features.\n",
|
||
"It is important to use an new Instance for each feature so they don't overlap with each other. \n",
|
||
"\n",
|
||
"### Standardize Values\n",
|
||
"We standardize only the text features to remove the stop words. The dataset allready provides standardized numerical features."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "4e8b407c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"ename": "ValueError",
|
||
"evalue": "all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
||
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
|
||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 11\u001b[39m\n\u001b[32m 3\u001b[39m \u001b[38;5;66;03m# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\u001b[39;00m\n\u001b[32m 4\u001b[39m column_transformer = make_column_transformer(\n\u001b[32m 5\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mdetailed_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 6\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mabout_the_game\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 7\u001b[39m (TfidfVectorizer(stop_words=\u001b[33m'\u001b[39m\u001b[33menglish\u001b[39m\u001b[33m'\u001b[39m), [\u001b[33m'\u001b[39m\u001b[33mshort_description\u001b[39m\u001b[33m'\u001b[39m]),\n\u001b[32m 8\u001b[39m (\u001b[33m'\u001b[39m\u001b[33mpassthrough\u001b[39m\u001b[33m'\u001b[39m, [\u001b[33m'\u001b[39m\u001b[33mrequired_age\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mprice\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdlc_count\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mreviews\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mwindows\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmac\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mlinux\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmetacritic_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33machievements\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mrecommendations\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnotes\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33msupported_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mfull_audio_languages\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mcategories\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mgenres\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33muser_score\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mscore_rank\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpositive\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnegative\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33maverage_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_forever\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mmedian_playtime_2weeks\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mdiscount\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpeak_ccu\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mtags\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_total\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mpct_pos_recent\u001b[39m\u001b[33m'\u001b[39m,\u001b[33m'\u001b[39m\u001b[33mnum_reviews_recent\u001b[39m\u001b[33m'\u001b[39m])\n\u001b[32m 9\u001b[39m )\n\u001b[32m---> \u001b[39m\u001b[32m11\u001b[39m dataset = \u001b[43mcolumn_transformer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit_transform\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdataset\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 12\u001b[39m \u001b[38;5;28mprint\u001b[39m(dataset.head())\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\utils\\_set_output.py:319\u001b[39m, in \u001b[36m_wrap_method_output.<locals>.wrapped\u001b[39m\u001b[34m(self, X, *args, **kwargs)\u001b[39m\n\u001b[32m 317\u001b[39m \u001b[38;5;129m@wraps\u001b[39m(f)\n\u001b[32m 318\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mwrapped\u001b[39m(\u001b[38;5;28mself\u001b[39m, X, *args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m319\u001b[39m data_to_wrap = \u001b[43mf\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 320\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(data_to_wrap, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 321\u001b[39m \u001b[38;5;66;03m# only wrap the first output for cross decomposition\u001b[39;00m\n\u001b[32m 322\u001b[39m return_tuple = (\n\u001b[32m 323\u001b[39m _wrap_data_with_container(method, data_to_wrap[\u001b[32m0\u001b[39m], X, \u001b[38;5;28mself\u001b[39m),\n\u001b[32m 324\u001b[39m *data_to_wrap[\u001b[32m1\u001b[39m:],\n\u001b[32m 325\u001b[39m )\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m 1382\u001b[39m estimator._validate_params()\n\u001b[32m 1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m 1385\u001b[39m skip_parameter_validation=(\n\u001b[32m 1386\u001b[39m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m 1387\u001b[39m )\n\u001b[32m 1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1031\u001b[39m, in \u001b[36mColumnTransformer.fit_transform\u001b[39m\u001b[34m(self, X, y, **params)\u001b[39m\n\u001b[32m 1028\u001b[39m \u001b[38;5;28mself\u001b[39m._validate_output(Xs)\n\u001b[32m 1029\u001b[39m \u001b[38;5;28mself\u001b[39m._record_output_indices(Xs)\n\u001b[32m-> \u001b[39m\u001b[32m1031\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_hstack\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mlist\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m=\u001b[49m\u001b[43mn_samples\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\sklearn\\compose\\_column_transformer.py:1225\u001b[39m, in \u001b[36mColumnTransformer._hstack\u001b[39m\u001b[34m(self, Xs, n_samples)\u001b[39m\n\u001b[32m 1215\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 1216\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mConcatenating DataFrames from the transformer\u001b[39m\u001b[33m'\u001b[39m\u001b[33ms output lead to\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1217\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m an inconsistent number of samples. The output may have Pandas\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m (...)\u001b[39m\u001b[32m 1220\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m samples.\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 1221\u001b[39m )\n\u001b[32m 1223\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m output\n\u001b[32m-> \u001b[39m\u001b[32m1225\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mnp\u001b[49m\u001b[43m.\u001b[49m\u001b[43mhstack\u001b[49m\u001b[43m(\u001b[49m\u001b[43mXs\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\FlorianSpeicher\\anaconda3\\Lib\\site-packages\\numpy\\_core\\shape_base.py:364\u001b[39m, in \u001b[36mhstack\u001b[39m\u001b[34m(tup, dtype, casting)\u001b[39m\n\u001b[32m 362\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m _nx.concatenate(arrs, \u001b[32m0\u001b[39m, dtype=dtype, casting=casting)\n\u001b[32m 363\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m364\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43m_nx\u001b[49m\u001b[43m.\u001b[49m\u001b[43mconcatenate\u001b[49m\u001b[43m(\u001b[49m\u001b[43marrs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[32;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcasting\u001b[49m\u001b[43m)\u001b[49m\n",
|
||
"\u001b[31mValueError\u001b[39m: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 3 has size 9999"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.compose import make_column_transformer\n",
|
||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
|
||
"column_transformer = make_column_transformer(\n",
|
||
" (TfidfVectorizer(stop_words='english'), ['detailed_description']),\n",
|
||
" (TfidfVectorizer(stop_words='english'), ['about_the_game']),\n",
|
||
" (TfidfVectorizer(stop_words='english'), ['short_description']),\n",
|
||
" ('passthrough', ['required_age','price','dlc_count','reviews','windows','mac','linux','metacritic_score','achievements','recommendations','notes','supported_languages','full_audio_languages','categories','genres','user_score','score_rank','positive','negative','average_playtime_forever','average_playtime_2weeks','median_playtime_forever','median_playtime_2weeks','discount','peak_ccu','tags','pct_pos_total','num_reviews_total','pct_pos_recent','num_reviews_recent'])\n",
|
||
")\n",
|
||
"\n",
|
||
"dataset = column_transformer.fit_transform(dataset)\n",
|
||
"print(dataset.head())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ad84e777",
|
||
"metadata": {},
|
||
"source": [
|
||
"\n",
|
||
"### Removing Bundles\n",
|
||
"**(TODO: decide whether yes or no), not as important as i thought**\n",
|
||
"As bundles don't have clear genre(s) defined (e.g. publisher bundles )"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6a2a3d4f",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Handling missing values\n",
|
||
"Removing NaN values in the dataSet and setting missing numerical feature values to the mean feature count. Missing Text values are set to a default String `Unknown`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "dea7dc00",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Setting missing numeric values to the mean\n",
|
||
"dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
|
||
"# Setting missing text values to 'Unknown'\n",
|
||
"dataset.fillna('Unknown', inplace=True)\n",
|
||
"# Setting missing values in other columns to NaN\n",
|
||
"dataset.dropna(inplace=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "091d7e13",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Data Split\n",
|
||
"Splitting our dataSet to training and testing data. The relation is 80% training and 20% testing data."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "cfbf3787",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Trainingsdaten: (7999, 33), Testdaten: (2000, 33)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"\n",
|
||
"# Setting the target feature 'genres' and dropping it from the dataset\n",
|
||
"X = dataset.drop('genres', axis=1)\n",
|
||
"y = dataset['genres']\n",
|
||
"\n",
|
||
"X_train, X_test, y_train, y_test = train_test_split(\n",
|
||
" X, y, test_size=0.2, random_state=42\n",
|
||
")\n",
|
||
"\n",
|
||
"print(f\"Training: {X_train.shape}, Testing: {X_test.shape}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "12b5283d",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Model Selection\n",
|
||
"**TODO Deciding which model to use for this task**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b7795aa1",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Training\n",
|
||
"**TODO Train the Selected Model with the training data**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0faa9856",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Evaluation\n",
|
||
"**TODO Test the Model with the test data**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "2aeb6fc2",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Optimization\n",
|
||
"**TODO optimize the model based on the test results**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "79b20645",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Validation\n",
|
||
"**TODO Predict actual values**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3b709fb7",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Conclusion and outlook\n",
|
||
"**TODO Write a conclusion and outlook what can be done and where the issues were.**"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "base",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|