Files
machine-learning/notebook.ipynb
2025-08-12 19:09:53 +02:00

379 lines
11 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "a3a7634f",
"metadata": {},
"source": [
"# Machine Learning project in SoSe 2025 at HTW Saar\n",
"## Idea\n",
"The goal of this project is predicting the genre(s) of a game/bundle through its given description(s)\n",
"\n",
"## Dataset\n",
"For our project we use a Steam Dataset provided on moodle, since it has all information we plan on using.\n",
"The Dataset has been cut to only 2000 data points to be runnable on weaker devices."
]
},
{
"cell_type": "code",
"id": "3116b75f",
"metadata": {
"jupyter": {
"is_executing": true
}
},
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn import set_config\n",
"\n",
"set_config(transform_output=\"pandas\")\n",
"\n",
"dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n",
"print(dataset.head())"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "cba9750a",
"metadata": {},
"source": [
"## Preparation of the Dataset\n",
"### Removing Uniques\n",
"We would remove the following features from the Training-Set as they can/could uniquely identify a datapoint, but we don't as they will be removed in the next step anyway\n",
"- AppId\n",
"- Name of the Game\n",
"- Realease Date\n",
"- Reviews\n",
"- Header Image\n",
"- Website\n",
"- Support URL\n",
"- Support Email\n",
"- MetaCritic URL\n",
"- Developer\n",
"- Publisher\n",
"- Screenshots\n",
"- Movies\n",
"- Estimated Owners"
]
},
{
"metadata": {},
"cell_type": "code",
"source": [
"#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
"#print(dataset.head())"
],
"id": "d159117377f3633c",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Hold onto necessary information\n",
"Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n",
"We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n",
"\n",
"It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed"
],
"id": "e1b28ddd69f1e9a6"
},
{
"metadata": {
"jupyter": {
"is_executing": true
}
},
"cell_type": "code",
"source": [
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.preprocessing import FunctionTransformer\n",
"\n",
"# desc, genres\n",
"column_transformer = ColumnTransformer([\n",
" # merge all descriptions\n",
" ('desc', FunctionTransformer(lambda X: X.fillna('').agg(' '.join, axis=1).to_frame(name=\"desc\")),\n",
" ['detailed_description', 'about_the_game', 'short_description']),\n",
" ('pass', 'passthrough', ['genres']),\n",
" ],\n",
" verbose_feature_names_out=False\n",
")\n",
"dataset = column_transformer.fit_transform(dataset)\n",
"print(dataset.head())"
],
"id": "986fbb31a7ae0d8b",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"### Adding missing Information\n",
"Some Games might not have any descriptions. For these we Input an Empty String\n",
"**TODO: check if dropna and fillna numeric_only is needed, as we dont have any numbers**"
],
"id": "f9b89c0645811564"
},
{
"metadata": {},
"cell_type": "code",
"source": [
"# missing numeric values => mean\n",
"dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
"# missing strings => empty string?\n",
"dataset.fillna('', inplace=True)\n",
"# drop all lines with missing values\n",
"dataset.dropna(inplace=True)"
],
"id": "44239f6b7fd23cde",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Transform Genres\n",
"The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n",
"\n",
"#### Serializing the String-Array\n",
"The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres."
],
"id": "ca5b59b9fa8160a0"
},
{
"metadata": {},
"cell_type": "code",
"source": [
"import ast\n",
"\n",
"dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n",
"print(dataset['genres'])"
],
"id": "ebc5a24e9bc87fdd",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"#### One-Hot-Encoding an Python-Array\n",
"The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n",
"Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support."
],
"id": "f90756f9ad9211f4"
},
{
"metadata": {},
"cell_type": "code",
"source": [
"from sklearn.preprocessing import MultiLabelBinarizer\n",
"\n",
"mlb_genres = MultiLabelBinarizer()\n",
"genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n",
"genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n",
"print(genres_df.head())"
],
"id": "d2c3527a5fc876bf",
"outputs": [],
"execution_count": null
},
{
"metadata": {},
"cell_type": "markdown",
"source": "With this, our target matrix is completed.",
"id": "671c01f9f4ae66d9"
},
{
"cell_type": "markdown",
"id": "f5436c87",
"metadata": {},
"source": [
"### Structurizing Text\n",
"If we want our Model to be able to use text as an input, we have to vectorize the text. TF-IDF (Inverse Document Frequency) is an easy way of transforming each word into a feature with a 0 to 1 value. **TODO: filter out stopwords**"
]
},
{
"cell_type": "code",
"id": "4e8b407c",
"metadata": {},
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer()\n",
"tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
"tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
"print(tfidf_df.head())"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "ad84e777",
"metadata": {},
"source": "With this our feature matrix is completed"
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"X = tfidf_df\n",
"y = genres_df"
],
"id": "86d9da42f4df8e49"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## The Model\n",
"\n",
"#### Removing unpredicatble Datapoints\n",
"Some Datapoints don't have a genre assigned (all feature values in y are 0). The model we use can't handle such cases, thus they have to be removed.\n",
"We filter after all values that we can use with a mask, and apply that mask to our matrices."
],
"id": "aeb782668f311cd8"
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"mask = y.sum(axis=1).map(lambda x: x > 0)\n",
"print((mask == False).sum()) # count of unpredictable datapoints\n",
"\n",
"X_clean = X[mask]\n",
"y_clean = y[mask]"
],
"id": "4919bf1b37d171a7"
},
{
"cell_type": "markdown",
"id": "091d7e13",
"metadata": {},
"source": [
"# Splitting up data\n",
"We have to split up our data into training and testing data.\n",
"Using random_state=0 guarantees reproducability."
]
},
{
"cell_type": "code",
"id": "cfbf3787",
"metadata": {
"jupyter": {
"is_executing": true
}
},
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"id": "12b5283d",
"metadata": {},
"source": [
"# Model Selection\n",
"**TODO Deciding which model to use for this task**\n",
"\n",
"As a game can have multiple genres, our Model(s) has to be capable of multi-label-classification. sklearn's ``MultiOutputClassifier`` can do this. As a backend for ``MultiOutputClassifier`` we use ``LogisticRegression``"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is too bad)\n",
"multi_target_clf = MultiOutputClassifier(LogisticRegression(max_iter=1337, random_state=0), n_jobs=1)\n",
"\n",
"multi_target_clf.fit(X_train, y_train)\n",
"\n",
"y_pred = multi_target_clf.predict(X_test)"
],
"id": "8c1d72c4532bd509"
},
{
"cell_type": "markdown",
"id": "0faa9856",
"metadata": {},
"source": [
"# Evaluation\n",
"**TODO Test the Model with the test data**"
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": "print(classification_report(y_test, y_pred, zero_division=0.0))",
"id": "e2ebea6945193e07"
},
{
"cell_type": "markdown",
"id": "2aeb6fc2",
"metadata": {},
"source": [
"# Optimization\n",
"**TODO optimize the model based on the test results**"
]
},
{
"cell_type": "markdown",
"id": "79b20645",
"metadata": {},
"source": [
"# Validation\n",
"**TODO Predict actual values**"
]
},
{
"cell_type": "markdown",
"id": "3b709fb7",
"metadata": {},
"source": [
"# Conclusion and outlook\n",
"**TODO Write a conclusion and outlook what can be done and where the issues were.**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}