Files
machine-learning/notebook.ipynb
2025-08-11 22:26:46 +02:00

351 lines
15 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "a3a7634f",
"metadata": {},
"source": [
"# Machine Learning project in SoSe 2025 at HTW Saar\n",
"## Idea\n",
"The goal of this project is getting the genre(s) of a game trough its given metadata\n",
"\n",
"## Dataset\n",
"For our project we use a Steam DataSet from kaggle. You can find it under the following URL: [Kaggle.com](https://www.kaggle.com/datasets/artermiloff/steam-games-dataset/data)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "3116b75f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" appid name release_date required_age price \\\n",
"0 730 Counter-Strike 2 2012-08-21 0 0.00 \n",
"1 578080 PUBG: BATTLEGROUNDS 2017-12-21 0 0.00 \n",
"2 570 Dota 2 2013-07-09 0 0.00 \n",
"3 271590 Grand Theft Auto V Legacy 2015-04-13 17 0.00 \n",
"4 359550 Tom Clancy's Rainbow Six® Siege 2015-12-01 17 3.99 \n",
"\n",
" dlc_count detailed_description \\\n",
"0 1 For over two decades, Counter-Strike has offer... \n",
"1 0 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 2 The most-played game on Steam. Every day, mill... \n",
"3 0 When a young street hustler, a retired bank ro... \n",
"4 9 Edition Comparison Ultimate Edition The Tom Cl... \n",
"\n",
" about_the_game \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 “One of the best first-person shooters ever ma... \n",
"\n",
" short_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
"2 Every day, millions of players worldwide enter... \n",
"3 Grand Theft Auto V for PC offers players the o... \n",
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
"\n",
" reviews ... \\\n",
"0 NaN ... \n",
"1 NaN ... \n",
"2 “A modern multiplayer masterpiece.” 9.5/10 D... ... \n",
"3 NaN ... \n",
"4 NaN ... \n",
"\n",
" average_playtime_2weeks median_playtime_forever median_playtime_2weeks \\\n",
"0 879 5174 350 \n",
"1 0 0 0 \n",
"2 1536 898 892 \n",
"3 771 7101 74 \n",
"4 682 2434 306 \n",
"\n",
" discount peak_ccu tags \\\n",
"0 0 1212356 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... \n",
"1 0 616738 {'Survival': 14838, 'Shooter': 12727, 'Battle ... \n",
"2 0 555977 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... \n",
"3 0 117698 {'Open World': 32644, 'Action': 23539, 'Multip... \n",
"4 80 89916 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... \n",
"\n",
" pct_pos_total num_reviews_total pct_pos_recent num_reviews_recent \n",
"0 86 8632939 82 96473 \n",
"1 59 2513842 68 16720 \n",
"2 81 2452595 80 29366 \n",
"3 87 1803832 92 17517 \n",
"4 84 1168020 76 12608 \n",
"\n",
"[5 rows x 47 columns]\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"# load data\n",
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"dataset = pd.read_csv(\"./games_march2025_cleaned.csv\",sep=\",\")\n",
"print(dataset.head())"
]
},
{
"cell_type": "markdown",
"id": "cba9750a",
"metadata": {},
"source": [
"## Preparation of the Training-Set\n",
"### Removing Uniques\n",
"We remove the following features from the Training-Set as they can uniquely identify a datapoint:\n",
"- AppId\n",
"- Name of the Game\n",
"- Realease Date\n",
"- Header Image\n",
"- Website\n",
"- Support URL\n",
"- Support Email\n",
"- MetaCritic URL\n",
"- Developer\n",
"- Publisher\n",
"- Screenshots\n",
"- Movies\n",
"- Estimated Owners"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "06dedcdf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" required_age price dlc_count \\\n",
"0 0 0.00 1 \n",
"1 0 0.00 0 \n",
"2 0 0.00 2 \n",
"3 17 0.00 0 \n",
"4 17 3.99 9 \n",
"\n",
" detailed_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 Edition Comparison Ultimate Edition The Tom Cl... \n",
"\n",
" about_the_game \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 “One of the best first-person shooters ever ma... \n",
"\n",
" short_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 Play PUBG: BATTLEGROUNDS for free. Land on str... \n",
"2 Every day, millions of players worldwide enter... \n",
"3 Grand Theft Auto V for PC offers players the o... \n",
"4 Tom Clancy's Rainbow Six® Siege is an elite, t... \n",
"\n",
" reviews windows mac linux \\\n",
"0 NaN True False True \n",
"1 NaN True False False \n",
"2 “A modern multiplayer masterpiece.” 9.5/10 D... True True True \n",
"3 NaN True False False \n",
"4 NaN True False False \n",
"\n",
" ... average_playtime_2weeks median_playtime_forever \\\n",
"0 ... 879 5174 \n",
"1 ... 0 0 \n",
"2 ... 1536 898 \n",
"3 ... 771 7101 \n",
"4 ... 682 2434 \n",
"\n",
" median_playtime_2weeks discount peak_ccu \\\n",
"0 350 0 1212356 \n",
"1 0 0 616738 \n",
"2 892 0 555977 \n",
"3 74 0 117698 \n",
"4 306 80 89916 \n",
"\n",
" tags pct_pos_total \\\n",
"0 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... 86 \n",
"1 {'Survival': 14838, 'Shooter': 12727, 'Battle ... 59 \n",
"2 {'Free to Play': 59933, 'MOBA': 20158, 'Multip... 81 \n",
"3 {'Open World': 32644, 'Action': 23539, 'Multip... 87 \n",
"4 {'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '... 84 \n",
"\n",
" num_reviews_total pct_pos_recent num_reviews_recent \n",
"0 8632939 82 96473 \n",
"1 2513842 68 16720 \n",
"2 2452595 80 29366 \n",
"3 1803832 92 17517 \n",
"4 1168020 76 12608 \n",
"\n",
"[5 rows x 34 columns]\n"
]
}
],
"source": [
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"dataset.drop(['appid', 'name', 'release_date', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
"print(dataset.head())"
]
},
{
"cell_type": "markdown",
"id": "f5436c87",
"metadata": {},
"source": [
"### Structurize Text\n",
"**TODO: check if makes sense**\n",
"The dataset holds a lot of unstructured data, we use Term Frequency-Inverse Document Frequency to structurize most Text-Features.\n",
"It is important to use an new Instance for each feature so they don't overlap with each other. \n",
"\n",
"### Standardize Values\n",
"We standardize only the text features to remove the stop words. The dataset allready provides standardized numerical features."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e8b407c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1. 1. 1. 1.]]\n"
]
}
],
"source": [
"from sklearn.compose import make_column_transformer\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,website,support_url,support_email,windows,mac,linux,metacritic_score,metacritic_url,achievements,recommendations,notes,supported_languages,full_audio_languages,packages,developers,publishers,categories,genres,screenshots,movies,user_score,score_rank,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent\n",
"column_transformer = make_column_transformer(\n",
" (TfidfVectorizer(stop_words='english'), ['detailed_description']),\n",
" (TfidfVectorizer(stop_words='english'), ['about_the_game']),\n",
" (TfidfVectorizer(stop_words='english'), ['short_description']),\n",
" (TfidfVectorizer(stop_words='english'), ['reviews']),\n",
")\n",
"\n",
"dataset2 = column_transformer.fit_transform(dataset)\n",
"print(dataset2)"
]
},
{
"cell_type": "markdown",
"id": "ad84e777",
"metadata": {},
"source": [
"\n",
"### Removing Bundles\n",
"**(TODO: decide whether yes or no), not as important as i thought**\n",
"As bundles don't have clear genre(s) defined (e.g. publisher bundles )"
]
},
{
"cell_type": "markdown",
"id": "6a2a3d4f",
"metadata": {},
"source": [
"### Setting missing values\n",
"**TODO: Removing or Setting values that are not set or NaN in the DataSet**"
]
},
{
"cell_type": "markdown",
"id": "091d7e13",
"metadata": {},
"source": [
"# Data Split\n",
"**TODO splitting the Data into Train, test and validation data**"
]
},
{
"cell_type": "markdown",
"id": "12b5283d",
"metadata": {},
"source": [
"# Model Selection\n",
"**TODO Deciding which model to use for this task**"
]
},
{
"cell_type": "markdown",
"id": "b7795aa1",
"metadata": {},
"source": [
"### Training\n",
"**TODO Train the Selected Model with the training data**"
]
},
{
"cell_type": "markdown",
"id": "0faa9856",
"metadata": {},
"source": [
"# Evaluation\n",
"**TODO Test the Model with the test data**"
]
},
{
"cell_type": "markdown",
"id": "2aeb6fc2",
"metadata": {},
"source": [
"# Optimization\n",
"**TODO optimize the model based on the test results**"
]
},
{
"cell_type": "markdown",
"id": "79b20645",
"metadata": {},
"source": [
"# Validation\n",
"**TODO Predict actual values**"
]
},
{
"cell_type": "markdown",
"id": "3b709fb7",
"metadata": {},
"source": [
"# Conclusion and outlook\n",
"**TODO Write a conclusion and outlook what can be done and where the issues were.**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}