677 lines
22 KiB
Plaintext
677 lines
22 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a3a7634f",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Machine Learning project in SoSe 2025 at HTW Saar\n",
|
|
"## Idea\n",
|
|
"The goal of this project is predicting the genre(s) of a game/bundle through its given description(s)\n",
|
|
"\n",
|
|
"## Dataset\n",
|
|
"For our project we use a Steam Dataset provided on moodle, since it has all information we plan on using.\n",
|
|
"The Dataset has been cut to only 2000 data points to be runnable on weaker devices."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"id": "3116b75f",
|
|
"metadata": {
|
|
"jupyter": {
|
|
"is_executing": true
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" appid name release_date required_age price dlc_count \\\n",
|
|
"0 730 Counter-Strike 2 2012-08-21 0 0.0 1 \n",
|
|
"\n",
|
|
" detailed_description \\\n",
|
|
"0 For over two decades, Counter-Strike has offer... \n",
|
|
"\n",
|
|
" about_the_game \\\n",
|
|
"0 For over two decades, Counter-Strike has offer... \n",
|
|
"\n",
|
|
" short_description reviews ... \\\n",
|
|
"0 For over two decades, Counter-Strike has offer... NaN ... \n",
|
|
"\n",
|
|
" average_playtime_2weeks median_playtime_forever median_playtime_2weeks \\\n",
|
|
"0 879 5174 350 \n",
|
|
"\n",
|
|
" discount peak_ccu tags \\\n",
|
|
"0 0 1212356 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... \n",
|
|
"\n",
|
|
" pct_pos_total num_reviews_total pct_pos_recent num_reviews_recent \n",
|
|
"0 86 8632939 82 96473 \n",
|
|
"\n",
|
|
"[1 rows x 47 columns]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn import set_config\n",
|
|
"\n",
|
|
"set_config(transform_output=\"pandas\")\n",
|
|
"\n",
|
|
"dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n",
|
|
"print(dataset.head(1))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cba9750a",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Preparation of the Dataset\n",
|
|
"### Removing Uniques\n",
|
|
"We would remove the following features from the Training-Set as they can/could uniquely identify a datapoint, but we don't as they will be removed in the next step anyway\n",
|
|
"- AppId\n",
|
|
"- Name of the Game\n",
|
|
"- Realease Date\n",
|
|
"- Reviews\n",
|
|
"- Header Image\n",
|
|
"- Website\n",
|
|
"- Support URL\n",
|
|
"- Support Email\n",
|
|
"- MetaCritic URL\n",
|
|
"- Developer\n",
|
|
"- Publisher\n",
|
|
"- Screenshots\n",
|
|
"- Movies\n",
|
|
"- Estimated Owners"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"id": "d159117377f3633c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
|
|
"#print(dataset.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e1b28ddd69f1e9a6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Hold onto necessary information\n",
|
|
"Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n",
|
|
"We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n",
|
|
"\n",
|
|
"It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"id": "986fbb31a7ae0d8b",
|
|
"metadata": {
|
|
"jupyter": {
|
|
"is_executing": true
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" desc \\\n",
|
|
"0 For over two decades, Counter-Strike has offer... \n",
|
|
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
|
|
"2 The most-played game on Steam. Every day, mill... \n",
|
|
"3 When a young street hustler, a retired bank ro... \n",
|
|
"4 Edition Comparison Ultimate Edition The Tom Cl... \n",
|
|
"\n",
|
|
" genres \n",
|
|
"0 ['Action', 'Free To Play'] \n",
|
|
"1 ['Action', 'Adventure', 'Massively Multiplayer... \n",
|
|
"2 ['Action', 'Strategy', 'Free To Play'] \n",
|
|
"3 ['Action', 'Adventure'] \n",
|
|
"4 ['Action'] \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.compose import ColumnTransformer\n",
|
|
"from sklearn.preprocessing import FunctionTransformer\n",
|
|
"\n",
|
|
"# desc, genres\n",
|
|
"column_transformer = ColumnTransformer([\n",
|
|
" # merge all descriptions\n",
|
|
" ('desc', FunctionTransformer(lambda X: X.fillna('').agg(' '.join, axis=1).to_frame(name=\"desc\")),\n",
|
|
" ['detailed_description', 'about_the_game', 'short_description']),\n",
|
|
" ('pass', 'passthrough', ['genres']),\n",
|
|
" ],\n",
|
|
" verbose_feature_names_out=False\n",
|
|
")\n",
|
|
"dataset = column_transformer.fit_transform(dataset)\n",
|
|
"print(dataset.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f9b89c0645811564",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Adding missing Information\n",
|
|
"Some Games might not have any descriptions. For these we Input an Empty String\n",
|
|
"**TODO: check if dropna and fillna numeric_only is needed, as we dont have any numbers**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"id": "44239f6b7fd23cde",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# missing numeric values => mean\n",
|
|
"dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
|
|
"# missing strings => empty string?\n",
|
|
"dataset.fillna('', inplace=True)\n",
|
|
"# drop all lines with missing values\n",
|
|
"dataset.dropna(inplace=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ca5b59b9fa8160a0",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Transform Genres\n",
|
|
"The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n",
|
|
"\n",
|
|
"#### Serializing the String-Array\n",
|
|
"The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"id": "ebc5a24e9bc87fdd",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0 [Action, Free To Play]\n",
|
|
"1 [Action, Adventure, Massively Multiplayer, Fre...\n",
|
|
"2 [Action, Strategy, Free To Play]\n",
|
|
"3 [Action, Adventure]\n",
|
|
"4 [Action]\n",
|
|
"Name: genres, dtype: object\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import ast\n",
|
|
"\n",
|
|
"dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n",
|
|
"print(dataset['genres'].head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f90756f9ad9211f4",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### One-Hot-Encoding an Python-Array\n",
|
|
"The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n",
|
|
"Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"id": "d2c3527a5fc876bf",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" Action Adventure Casual Early Access Free To Play Gore Indie \\\n",
|
|
"0 1 0 0 0 1 0 0 \n",
|
|
"1 1 1 0 0 1 0 0 \n",
|
|
"2 1 0 0 0 1 0 0 \n",
|
|
"3 1 1 0 0 0 0 0 \n",
|
|
"4 1 0 0 0 0 0 0 \n",
|
|
"\n",
|
|
" Massively Multiplayer RPG Racing Simulation Sports Strategy Violent \n",
|
|
"0 0 0 0 0 0 0 0 \n",
|
|
"1 1 0 0 0 0 0 0 \n",
|
|
"2 0 0 0 0 0 1 0 \n",
|
|
"3 0 0 0 0 0 0 0 \n",
|
|
"4 0 0 0 0 0 0 0 \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.preprocessing import MultiLabelBinarizer\n",
|
|
"\n",
|
|
"mlb_genres = MultiLabelBinarizer()\n",
|
|
"genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n",
|
|
"genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n",
|
|
"print(genres_df.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "671c01f9f4ae66d9",
|
|
"metadata": {},
|
|
"source": [
|
|
"With this, our target matrix is completed."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f5436c87",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Structurizing Text\n",
|
|
"If we want our Model to be able to use text as an input, we have to vectorize the text. TF-IDF (Inverse Document Frequency) is an easy way of transforming each word into a feature with a 0 to 1 value. **TODO: filter out stopwords**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"id": "4e8b407c",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" 00 000 000km 000th 00am 00f 00i 00p 00v 01 ... 이터널 이터널리턴 \\\n",
|
|
"0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
|
"1 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
|
"2 0.0 0.0 0.0 0.14649 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
|
"3 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
|
"4 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
|
|
"\n",
|
|
" 이현준 정대찬 중입니다 철권 토탈워 페르소나 한국어 한글을 \n",
|
|
"0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
|
"1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
|
"2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
|
"3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
|
"4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
|
|
"\n",
|
|
"[5 rows x 29351 columns]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
|
"\n",
|
|
"vectorizer = TfidfVectorizer()\n",
|
|
"tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
|
|
"tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
|
|
"print(tfidf_df.head())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ad84e777",
|
|
"metadata": {},
|
|
"source": [
|
|
"With this our feature matrix is completed"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"id": "86d9da42f4df8e49",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X = tfidf_df\n",
|
|
"y = genres_df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "aeb782668f311cd8",
|
|
"metadata": {},
|
|
"source": [
|
|
"## The Model\n",
|
|
"\n",
|
|
"#### Removing unpredicatble Datapoints\n",
|
|
"\n",
|
|
"Some genres have too little datapoints to be predictable. The 10k Dataset has 12 Classes that have less than 5 Datapoints, usually only 1 oder 2. These have too big of a probability that they will fall into only the train or test data and therefore will be removed. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"id": "e1bc73d4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Before(1999, 14)\n",
|
|
"After(1999, 12)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# remove genres that have less than min_entries entries -> probability of broken split to big\n",
|
|
"mask = (y == 1).sum() >= 5\n",
|
|
"print(\"Before\" + str(y.shape))\n",
|
|
"y_prep = y.loc[:, mask]\n",
|
|
"print(\"After\" + str(y_prep.shape))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2fa60e6b",
|
|
"metadata": {},
|
|
"source": [
|
|
"Some Datapoints don't have a genre assigned (all feature values in y are 0, either from the start or after we removed them one step before). The model we use can't handle such cases, thus they have to be removed.\n",
|
|
"We filter after all values that we can use with a mask, and apply that mask to our matrices."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"id": "4919bf1b37d171a7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"13\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"mask = y.sum(axis=1).map(lambda x: x > 0)\n",
|
|
"print((mask == False).sum()) # count of unpredictable datapoints\n",
|
|
"\n",
|
|
"X_clean = X[mask]\n",
|
|
"y_clean = y_prep[mask]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "091d7e13",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Splitting up data\n",
|
|
"We have to split up our data into training and testing data.\n",
|
|
"Using random_state=0 guarantees reproducability."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"id": "cfbf3787",
|
|
"metadata": {
|
|
"jupyter": {
|
|
"is_executing": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8cd4bb54",
|
|
"metadata": {},
|
|
"source": [
|
|
"We also do a little cleanup session before proceeding."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"id": "0b0a46a4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"1905"
|
|
]
|
|
},
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import gc\n",
|
|
"\n",
|
|
"# Initial dataset loading\n",
|
|
"del dataset\n",
|
|
"del column_transformer\n",
|
|
"\n",
|
|
"# preparation of y\n",
|
|
"del mlb_genres\n",
|
|
"del genres_encoded\n",
|
|
"del genres_df\n",
|
|
"\n",
|
|
"# preparation of X\n",
|
|
"del tfidf_df\n",
|
|
"del vectorizer\n",
|
|
"del tfidf_matrix\n",
|
|
"\n",
|
|
"# Initial Dataset\n",
|
|
"del X\n",
|
|
"del y\n",
|
|
"# Removing Genres with less than 5 datapoints\n",
|
|
"del y_prep\n",
|
|
"\n",
|
|
"# Sorting out dead datapoints (all target values are 0)\n",
|
|
"del X_clean\n",
|
|
"del y_clean\n",
|
|
"del mask\n",
|
|
"\n",
|
|
"gc.collect()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "84f56229",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that all data is prepared, we need to choose a Classification Model that meets our stanadrds."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "917ba82f",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Excursion: Choosing a classification Model\n",
|
|
"``sklearn`` has many different classification Models to choose from, but we only have limited time and computing power.\n",
|
|
"As such, we tested many different models on the 2k Dataset and chose the 5 best performing ones for the big dataset.\n",
|
|
"\n",
|
|
"### Initial Comparison\n",
|
|
"We won't put the comparison script in this notebook, but you can find it in the ``compare_models_2k.py`` file and try it out yourself.\n",
|
|
"There were some rules as a baseline for comparison:\n",
|
|
"- All Hyperparameters are set to default\n",
|
|
"- All iteration limits are set to 3000 (exception: MLPClassifier with 300, where i-limit are epochs instead of iterations )\n",
|
|
"- All ``random_state``s are set to 0\n",
|
|
"\n",
|
|
"Running all models with that configuration yields the following weighted F1-Scores (results as seen in the ``games_march2025_cleaned_2k_i3k`` folder): \n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"If we also compare Micro/Macro values, we see that all models have a much lower Macro-F1 than Micro/Weighted-F1. That is because the 2k Dataset does not contain enough datapoints for every class (test data for 2 classes is 0), so we should proceed to the 10k Dataset before making major choices.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"The 10 best performing models which will run on the 10k Dataset with the same rules as before:\n",
|
|
"1. NearestCentroid\n",
|
|
"2. Perceptron\n",
|
|
"3. PassiveAggressiveClassifier\n",
|
|
"4. LinearSVC\n",
|
|
"5. SDGClassifer\n",
|
|
"6. HistGradientBoostingClassifier\n",
|
|
"7. MLPClassifier\n",
|
|
"8. RidgeClassifier\n",
|
|
"9. GradientBoostingClassifier\n",
|
|
"10. LinearDiscriminationAnalysis\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"We can also compare these models between datasets, to see if a bigger dataset always improves the performance.\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"The final contenders are:\n",
|
|
"1.\n",
|
|
"2.\n",
|
|
"3.\n",
|
|
"4.\n",
|
|
"5.\n",
|
|
"\n",
|
|
"..."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "12b5283d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Model Selection\n",
|
|
"**TODO Deciding which model to use for this task**\n",
|
|
"\n",
|
|
"As a game can have multiple genres, our Model(s) has to be capable of multi-label-classification. sklearn's ``MultiOutputClassifier`` can do this. As a backend for ``MultiOutputClassifier`` we use ``LogisticRegression``"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"id": "8c1d72c4532bd509",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.linear_model import LogisticRegression\n",
|
|
"from sklearn.multioutput import MultiOutputClassifier\n",
|
|
"\n",
|
|
"# n_jobs=1 since there seems to be some multithreading join issue in sklearn (or my pc is to bad)\n",
|
|
"multi_target_clf = MultiOutputClassifier(LogisticRegression(max_iter=1337, random_state=0), n_jobs=1)\n",
|
|
"\n",
|
|
"multi_target_clf.fit(X_train, y_train)\n",
|
|
"\n",
|
|
"y_pred = multi_target_clf.predict(X_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0faa9856",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Evaluation\n",
|
|
"**TODO Test the Model with the test data**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"id": "e2ebea6945193e07",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" precision recall f1-score support\n",
|
|
"\n",
|
|
" 0 0.78 0.91 0.84 300\n",
|
|
" 1 0.78 0.62 0.69 216\n",
|
|
" 2 1.00 0.03 0.07 86\n",
|
|
" 3 0.00 0.00 0.00 46\n",
|
|
" 4 1.00 0.04 0.07 83\n",
|
|
" 5 0.79 0.81 0.80 245\n",
|
|
" 6 0.00 0.00 0.00 42\n",
|
|
" 7 0.90 0.34 0.49 127\n",
|
|
" 8 0.00 0.00 0.00 12\n",
|
|
" 9 0.89 0.25 0.39 127\n",
|
|
" 10 0.00 0.00 0.00 14\n",
|
|
" 11 0.88 0.14 0.24 106\n",
|
|
"\n",
|
|
" micro avg 0.79 0.50 0.61 1404\n",
|
|
" macro avg 0.58 0.26 0.30 1404\n",
|
|
"weighted avg 0.77 0.50 0.53 1404\n",
|
|
" samples avg 0.77 0.56 0.60 1404\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.metrics import classification_report\n",
|
|
"\n",
|
|
"print(classification_report(y_test, y_pred, zero_division=0.0))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2aeb6fc2",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Optimization\n",
|
|
"**TODO optimize the model based on the test results**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "79b20645",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Validation\n",
|
|
"**TODO Predict actual values**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3b709fb7",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Conclusion and outlook\n",
|
|
"**TODO Write a conclusion and outlook what can be done and where the issues were.**"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|