Rename and add html and pdf

This commit is contained in:
FlorianSpeicher
2025-08-25 22:10:52 +02:00
parent fba98410f6
commit ad53cc55cb
3 changed files with 8346 additions and 19 deletions

734
Machine-Learning.ipynb Normal file
View File

@@ -0,0 +1,734 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a3a7634f",
"metadata": {},
"source": [
"# Machine Learning project in SoSe 2025 at HTW Saar\n",
"## Idea\n",
"The goal of this project is predicting the genre(s) of a game/bundle through its given description(s)\n",
"\n",
"## Dataset\n",
"For our project we use a Steam Dataset provided on moodle, since it has all information we plan on using.\n",
"The Dataset has been cut to only 2000 data points to be runnable on weaker devices."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "3116b75f",
"metadata": {
"jupyter": {
"is_executing": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" appid name release_date required_age price dlc_count \\\n",
"0 730 Counter-Strike 2 2012-08-21 0 0.0 1 \n",
"\n",
" detailed_description \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"\n",
" about_the_game \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"\n",
" short_description reviews ... \\\n",
"0 For over two decades, Counter-Strike has offer... NaN ... \n",
"\n",
" average_playtime_2weeks median_playtime_forever median_playtime_2weeks \\\n",
"0 879 5174 350 \n",
"\n",
" discount peak_ccu tags \\\n",
"0 0 1212356 {'FPS': 90857, 'Shooter': 65397, 'Multiplayer'... \n",
"\n",
" pct_pos_total num_reviews_total pct_pos_recent num_reviews_recent \n",
"0 86 8632939 82 96473 \n",
"\n",
"[1 rows x 47 columns]\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn import set_config\n",
"\n",
"set_config(transform_output=\"pandas\")\n",
"\n",
"dataset = pd.read_csv(\"./games_march2025_cleaned_2k.csv\",sep=\",\")\n",
"print(dataset.head(1))"
]
},
{
"cell_type": "markdown",
"id": "cba9750a",
"metadata": {},
"source": [
"## Preparation of the Dataset\n",
"### Removing Uniques\n",
"We would remove the following features from the Training-Set as they can/could uniquely identify a datapoint, but we don't as they will be removed in the next step anyway\n",
"- AppId\n",
"- Name of the Game\n",
"- Realease Date\n",
"- Reviews\n",
"- Header Image\n",
"- Website\n",
"- Support URL\n",
"- Support Email\n",
"- MetaCritic URL\n",
"- Developer\n",
"- Publisher\n",
"- Screenshots\n",
"- Movies\n",
"- Estimated Owners"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d159117377f3633c",
"metadata": {},
"outputs": [],
"source": [
"#dataset.drop(['appid', 'name', 'release_date', 'reviews', 'header_image', 'website', 'support_url', 'support_email', 'metacritic_url', 'notes', 'developers', 'publishers', 'screenshots', 'movies', 'estimated_owners'], axis=1, inplace=True)\n",
"#print(dataset.head())"
]
},
{
"cell_type": "markdown",
"id": "e1b28ddd69f1e9a6",
"metadata": {},
"source": [
"## Hold onto necessary information\n",
"Our model should turn a textual description of a game into its genre. For that we need all the textual information a game has, as well as the genres of the game.\n",
"We use a ColumnTransformer to drop all unnecessary lines, merge all descriptions of a game into one big description and hold onto the genres\n",
"\n",
"It is important to use ``verbose_feature_names_out=False`` so the feature names don't get changed"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "986fbb31a7ae0d8b",
"metadata": {
"jupyter": {
"is_executing": true
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" desc \\\n",
"0 For over two decades, Counter-Strike has offer... \n",
"1 LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ... \n",
"2 The most-played game on Steam. Every day, mill... \n",
"3 When a young street hustler, a retired bank ro... \n",
"4 Edition Comparison Ultimate Edition The Tom Cl... \n",
"\n",
" genres \n",
"0 ['Action', 'Free To Play'] \n",
"1 ['Action', 'Adventure', 'Massively Multiplayer... \n",
"2 ['Action', 'Strategy', 'Free To Play'] \n",
"3 ['Action', 'Adventure'] \n",
"4 ['Action'] \n"
]
}
],
"source": [
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.preprocessing import FunctionTransformer\n",
"\n",
"# desc, genres\n",
"column_transformer = ColumnTransformer([\n",
" # merge all descriptions\n",
" ('desc', FunctionTransformer(lambda X: X.fillna('').agg(' '.join, axis=1).to_frame(name=\"desc\")),\n",
" ['detailed_description', 'about_the_game', 'short_description']),\n",
" ('pass', 'passthrough', ['genres']),\n",
" ],\n",
" verbose_feature_names_out=False\n",
")\n",
"dataset = column_transformer.fit_transform(dataset)\n",
"print(dataset.head())"
]
},
{
"cell_type": "markdown",
"id": "f9b89c0645811564",
"metadata": {},
"source": [
"### Adding missing Information\n",
"Some Games might not have any descriptions. For these we Input an Empty String."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "44239f6b7fd23cde",
"metadata": {},
"outputs": [],
"source": [
"# missing numeric values => mean\n",
"dataset.fillna(dataset.mean(numeric_only=True), inplace=True)\n",
"# missing strings => empty string?\n",
"dataset.fillna('', inplace=True)\n",
"# drop all lines with missing values\n",
"dataset.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "ca5b59b9fa8160a0",
"metadata": {},
"source": [
"## Transform Genres\n",
"The genre information currently is a string holding a python array of genres. While this is machine-readable, we need One-Hot-Encoding for our model to work.\n",
"\n",
"#### Serializing the String-Array\n",
"The \"ast\" library can interpret python strings as python code, and as such will be used for serializing the genres."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ebc5a24e9bc87fdd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 [Action, Free To Play]\n",
"1 [Action, Adventure, Massively Multiplayer, Fre...\n",
"2 [Action, Strategy, Free To Play]\n",
"3 [Action, Adventure]\n",
"4 [Action]\n",
"Name: genres, dtype: object\n"
]
}
],
"source": [
"import ast\n",
"\n",
"dataset['genres'] = dataset['genres'].map(lambda s: ast.literal_eval(s))\n",
"print(dataset['genres'].head())"
]
},
{
"cell_type": "markdown",
"id": "f90756f9ad9211f4",
"metadata": {},
"source": [
"#### One-Hot-Encoding an Python-Array\n",
"The sklearn ``OneHotEncoder()`` is only able to work with an 1D Array of different classes, such as ``['Politics', 'Sport', 'Culture']``. Every datapoint can only have one concurrent classification.\n",
"Steam allows an app/bundle to have multiple genres. As such, our dataset has an 2D Array of different classes, which sklearn's ``MultiLabelBinarizer()`` does support."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "d2c3527a5fc876bf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Action Adventure Casual Early Access Free To Play Gore Indie \\\n",
"0 1 0 0 0 1 0 0 \n",
"1 1 1 0 0 1 0 0 \n",
"2 1 0 0 0 1 0 0 \n",
"3 1 1 0 0 0 0 0 \n",
"4 1 0 0 0 0 0 0 \n",
"\n",
" Massively Multiplayer RPG Racing Simulation Sports Strategy Violent \n",
"0 0 0 0 0 0 0 0 \n",
"1 1 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 1 0 \n",
"3 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 0 \n"
]
}
],
"source": [
"from sklearn.preprocessing import MultiLabelBinarizer\n",
"\n",
"mlb_genres = MultiLabelBinarizer()\n",
"genres_encoded = mlb_genres.fit_transform(dataset.pop('genres'))\n",
"genres_df = pd.DataFrame(genres_encoded, columns=mlb_genres.classes_)\n",
"print(genres_df.head())"
]
},
{
"cell_type": "markdown",
"id": "671c01f9f4ae66d9",
"metadata": {},
"source": [
"With this, our target matrix is completed."
]
},
{
"cell_type": "markdown",
"id": "f5436c87",
"metadata": {},
"source": [
"### Structurizing Text\n",
"If we want our Model to be able to use text as an input, we have to vectorize the text. TF-IDF (Inverse Document Frequency) is an easy way of transforming each word into a feature with a 0 to 1 value."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4e8b407c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 00 000 000km 000th 00am 00f 00i 00p 00v 01 ... 이터널 이터널리턴 \\\n",
"0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
"2 0.0 0.0 0.0 0.162349 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
"3 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
"4 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 \n",
"\n",
" 이현준 정대찬 중입니다 철권 토탈워 페르소나 한국어 한글을 \n",
"0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
"[5 rows x 29056 columns]\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer(stop_words='english')\n",
"tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
"tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
"print(tfidf_df.head())"
]
},
{
"cell_type": "markdown",
"id": "ad84e777",
"metadata": {},
"source": [
"With this our feature matrix is completed"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "86d9da42f4df8e49",
"metadata": {},
"outputs": [],
"source": [
"X = tfidf_df\n",
"y = genres_df"
]
},
{
"cell_type": "markdown",
"id": "aeb782668f311cd8",
"metadata": {},
"source": [
"## The Model\n",
"\n",
"#### Removing unpredicatble Datapoints\n",
"\n",
"Some genres have too little datapoints to be predictable. The 10k Dataset has 14 Classes that have less than 10 Datapoints, usually only 1 to 4. These have too big of a probability that they will fall into only the train or test data and therefore will be removed."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e1bc73d4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Before(1999, 14)\n",
"After(1999, 12)\n"
]
}
],
"source": [
"# remove genres that have less than min_entries entries -> probability of broken split to big\n",
"mask = (y == 1).sum() >= 10\n",
"print(\"Before\" + str(y.shape))\n",
"y_prep = y.loc[:, mask]\n",
"print(\"After\" + str(y_prep.shape))"
]
},
{
"cell_type": "markdown",
"id": "2fa60e6b",
"metadata": {},
"source": [
"Some Datapoints don't have a genre assigned (all feature values in y are 0, either from the start or after we removed them one step before). The model we use can't handle such cases, thus they have to be removed.\n",
"We filter after all values that we can use with a mask, and apply that mask to our matrices."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "4919bf1b37d171a7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"13\n"
]
}
],
"source": [
"mask = y.sum(axis=1).map(lambda x: x > 0)\n",
"print((mask == False).sum()) # count of unpredictable datapoints\n",
"\n",
"X_clean = X[mask]\n",
"y_clean = y_prep[mask]"
]
},
{
"cell_type": "markdown",
"id": "091d7e13",
"metadata": {},
"source": [
"# Splitting up data\n",
"We have to split up our data into training and testing data.\n",
"Using random_state=0 guarantees reproducability."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "cfbf3787",
"metadata": {
"jupyter": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, random_state=0)"
]
},
{
"cell_type": "markdown",
"id": "8cd4bb54",
"metadata": {},
"source": [
"We also do a little cleanup session before proceeding."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0b0a46a4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"82"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import gc\n",
"\n",
"# Initial dataset loading\n",
"del dataset\n",
"del column_transformer\n",
"\n",
"# preparation of y\n",
"del mlb_genres\n",
"del genres_encoded\n",
"del genres_df\n",
"\n",
"# preparation of X\n",
"del tfidf_df\n",
"del tfidf_matrix\n",
"\n",
"# Initial Dataset\n",
"del X\n",
"del y\n",
"# Removing Genres with less than 5 datapoints\n",
"del y_prep\n",
"\n",
"# Sorting out dead datapoints (all target values are 0)\n",
"del X_clean\n",
"del y_clean\n",
"del mask\n",
"\n",
"gc.collect()"
]
},
{
"cell_type": "markdown",
"id": "84f56229",
"metadata": {},
"source": [
"Now that all data is prepared, we need to choose a Classification Model that meets our stanadrds."
]
},
{
"cell_type": "markdown",
"id": "917ba82f",
"metadata": {},
"source": [
"# Excursion: Choosing a classification Model\n",
"``sklearn`` has many different classification Models to choose from, but we only have limited time and computing power.\n",
"As such, we tested many different models on the small dataset and chose the best performing ones for the big dataset.\n",
"\n",
"### Initial Comparison\n",
"We won't put the comparison script in this notebook, but you can find it in the ``compare_models_2k.py`` file and try it out yourself.\n",
"There were some rules as a baseline for comparison:\n",
"- All Hyperparameters are set to default\n",
"- All iteration limits are set to 3000 (exception: MLPClassifier with 300, where i-limit are epochs instead of iterations)\n",
"- All ``random_state``s are set to 0\n",
"\n",
"Running all models with that configuration yields the following weighted F1-Scores (results as seen in the ``games_march2025_cleaned_2k_i3k`` folder): \n",
"\n",
"![Comparison Image 2k](./compare_models_2k.png)\n",
"\n",
"If we also compare Micro/Macro values, we see that all models have a much lower Macro-F1 than Micro/Weighted-F1. That is because the Dataset does not contain enough datapoints for every class (test data for 2 classes is 0 in the 2k dataset), so we should proceed to the 10k Dataset.\n",
"\n",
"![Comparison Image 2k Micro/Macro/Weighted](./compare_models_2k_3.png)\n",
"\n",
"The 10 best performing models which will run on the 10k Dataset with the same rules as before:\n",
"1. PassiveAggressiveClassifier \n",
"2. Perceptron\n",
"3. LinearSVC\n",
"4. SDGClassifer\n",
"5. HistGradientBoostingClassifier\n",
"6. NearestCentroid\n",
"7. MLPClassifier\n",
"8. GradientBoostingClassifier \n",
"9. RidgeClassifier\n",
"10. AdaBoostClassifier (because of an evaluation mistake, we used LinearDiscriminantAnalysis instead)\n",
"\n",
"That gave us the following results:\n",
"\n",
"![Comparison Image 10k](./compare_models_10k.png)\n",
"![Comparison Image 10k](./compare_models_10k_3.png)\n",
"\n",
"The top 5 are the same, with the only exception of Perceptron falling behind against the RidgeClassifier.\n",
"When comparing these models between datasets, it is evident that a bigger dataset yields better performance (for exponentially higher compute and time cost). Only NearestCentroid lost performance when comparing the Datasets.\n",
"\n",
"![Comparison Image between 2k and 10k](./compare_datasets_2k.png)\n",
"![Comparison Image between 2k and 10k, only 10k Models](./compare_datasets_10k.png)\n",
"\n",
"The final contenders are LinearSVC and PassiveAggressiveClassifier, which we would compare against each other using k-fold cross validation with different hyperparameters, but since training the model on the dataset takes a lot of time and a big strain on our computers, we will stop here and use the LinearSVC Classifier."
]
},
{
"cell_type": "markdown",
"id": "12b5283d",
"metadata": {},
"source": [
"## Model Selection\n",
"\n",
"As a game can have multiple genres, our Model(s) has to be capable of multi-label-classification. sklearn's ``MultiOutputClassifier`` can do this. As a backend for ``MultiOutputClassifier`` we use ``LinearSVC``"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "8c1d72c4532bd509",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import LinearSVC\n",
"from sklearn.multioutput import MultiOutputClassifier\n",
"\n",
"multi_target_clf = MultiOutputClassifier(LinearSVC(max_iter=1337, random_state=0), n_jobs=1)\n",
"\n",
"multi_target_clf.fit(X_train, y_train)\n",
"\n",
"y_pred = multi_target_clf.predict(X_test)"
]
},
{
"cell_type": "markdown",
"id": "0faa9856",
"metadata": {},
"source": [
"# Evaluation\n",
"We evaluate our model by comparing the test data with the predicted data. We are using the worst case scenario by setting zero_division=0.0 in the classification report. This means that if a metric cannot be calculated due to division by zero, it is set to 0.0. Setting this parameter to 1.0 (best case) does not significantly change the results.\n",
"\n",
"Our approach involves training one model per genre, resulting in a total of 12 models. Each model predicts a specific genre, and the combined results of all models are shown at the bottom of the report. The input features are represented by x, and the output labels by y.\n",
"\n",
"Key metrics such as precision and recall are calculated for each class. These metrics indicate whether all classes are recognized and how accurate the predictions are. Notably, only one class achieves perfect 1.0 precision. For some reason, the Early Access class performs particularly poorly. The F1 score is also included in the evaluation, as it provides a balanced measure of precision and recall. The support column indicates the number of samples for each class.\n",
"\n",
"\n",
"It is noteworthy that some of the top 10 words influencing the decision process are related to brands, such as \"ea\" in Sports, even though we removed the developer and publisher columns. Some words, like \"brokkoli\" in Racing, are not obviously related to the genre, which may indicate slight overfitting or the presence of only a few relevant but fitting data points in the dataset.\n",
"\n",
"Generally, a model is considered very good with an F1 score above 0.8, and good with a score above 0.7. In our case, the F1 scores are 0.69 and 0.54, which means our model performs moderately well up to good. The low macro and micro scores are mainly due to problematic classes, but overall, the weighted average and samples average are quite acceptable.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "e2ebea6945193e07",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" Action 0.86 0.87 0.87 300\n",
" Adventure 0.74 0.66 0.70 216\n",
" Casual 0.79 0.22 0.35 86\n",
" Early Access 0.50 0.02 0.04 46\n",
" Free To Play 0.79 0.28 0.41 83\n",
" Indie 0.77 0.81 0.79 245\n",
"Massively Multiplayer 0.89 0.19 0.31 42\n",
" RPG 0.80 0.55 0.65 127\n",
" Racing 1.00 0.58 0.74 12\n",
" Simulation 0.86 0.50 0.64 127\n",
" Sports 1.00 0.29 0.44 14\n",
" Strategy 0.80 0.41 0.54 106\n",
"\n",
" micro avg 0.81 0.60 0.69 1404\n",
" macro avg 0.82 0.45 0.54 1404\n",
" weighted avg 0.80 0.60 0.65 1404\n",
" samples avg 0.81 0.66 0.69 1404\n",
"\n",
"Most important words of class 'Action':\n",
"['action', 'weapons', 'shooter', 'fighting', 'fight', 'weapon', 'players', 'aim', 'gun', 'intense']\n",
"\n",
"Most important words of class 'Adventure':\n",
"['adventure', 'explore', 'puzzles', 'smite', 'far', 'stories', 'remake', 'hunting', 'don', 'secrets']\n",
"\n",
"Most important words of class 'Casual':\n",
"['puzzle', 'color', 'ball', 'smite', 'poker', 'click', 'communication', 'idle', 'cats', 'fun']\n",
"\n",
"Most important words of class 'Early Access':\n",
"['early', 'pals', 'backrooms', 'automation', 'rotwood', 'access', 'design', 'vrchat', 'nephelym', 'idleon']\n",
"\n",
"Most important words of class 'Free To Play':\n",
"['free', 'royale', 'mmo', 'pvp', 'arena', 'mmorpg', 'idle', 'cats', 'millions', 'team']\n",
"\n",
"Most important words of class 'Indie':\n",
"['game', 'horror', 'building', 'different', 'vermintide', 'generated', 'roguelike', 'better', 'soundtrack', 'procedurally']\n",
"\n",
"Most important words of class 'Massively Multiplayer':\n",
"['royale', 'mmorpg', 'players', 'mmo', 'pvp', 'ball', 'smite', 'scp', 'temtem', 'join']\n",
"\n",
"Most important words of class 'RPG':\n",
"['rpg', 'loot', 'dungeons', 'combat', 'dungeon', 'character', 'fantasy', 'quests', 'skills', '觅长生']\n",
"\n",
"Most important words of class 'Racing':\n",
"['cars', 'racing', 'car', 'race', 'speed', 'driving', 'brokkoli', 'ddnet', 'rally', 'jeff']\n",
"\n",
"Most important words of class 'Simulation':\n",
"['simulator', 'realistic', 'simulation', 'physics', 'sandbox', 'building', 'workshop', 'management', 'car', 'idle']\n",
"\n",
"Most important words of class 'Sports':\n",
"['racing', 'skate', 'sports', 'football', 'rally', 'virtual', 'ea', 'vrchat', 'hunting', 'realistic']\n",
"\n",
"Most important words of class 'Strategy':\n",
"['strategy', 'turn', 'units', 'buildings', 'strategic', 'heroes', 'tactical', 'command', '觅长生', 'squad']\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(y_test, y_pred, target_names=y_test.columns, zero_division=0.0))\n",
"\n",
"feature_names = vectorizer.get_feature_names_out()\n",
"class_names = y_test.columns\n",
"\n",
"for i, class_name in enumerate(class_names):\n",
" coef = multi_target_clf.estimators_[i].coef_.flatten()\n",
" # print the top 10 coefficients used\n",
" top10 = np.argsort(coef)[-10:]\n",
" print(f\"Most important words of class '{class_name}':\")\n",
" print([feature_names[j] for j in top10][::-1]) \n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "2aeb6fc2",
"metadata": {},
"source": [
"# Optimization\n",
"- Since our dataset contains multiple languages, it would be beneficial to either train a separate model for each language or to standardize the data before and remove the stop words specific to each language.\n",
"\n",
"- Hyperparameter validation should also be performed. For example, in LinearSVC, the C parameter controls the learning rate and could be further optimized.\n",
"\n",
"- Instead of a simple train-test split, k-fold cross validation should be used to achieve better data mixing and more robust results.\n",
"\n",
"- Additionally, ensemble learning methods could be considered to further improve performance.\n",
"\n",
"The biggest limitation of our dataset is the presence of too many languages but too few entries for each, which is also constrained by our computing resources."
]
},
{
"cell_type": "markdown",
"id": "3b709fb7",
"metadata": {},
"source": [
"# Conclusion and outlook\n",
"To conclude we can say that our model performs reasonably well for the intended application. With a larger dataset, the results would likely improve further. Considering the points mentioned above, it is quite impressive that the model achieves these results using only a small dataset and limited computational resources.\n",
"\n",
"Our collaboration as a team worked very smoothly throughout the project. Communication and planning were effective, allowing us to coordinate our tasks efficiently and make steady progress.\n",
"\n",
"The main challenge we faced was the limited computational resources available to us. Especially when working with the 10k dataset, training the models for statistical evaluation took a considerable amount of time. To address this, each team member ran different models in parallel on their own machines, with some training processes running for several days.\n",
"\n",
"Due to these computational constraints, we decided not to process the full dataset with 80,000 entries. Even though we had access to very powerful PCs equipped with the latest high-end components, the training times were still prohibitively long. As a result, we focused our efforts on the smaller datasets to ensure we could complete the project within a reasonable timeframe.\n",
"\n",
"In summary, this project provided us with valuable insights into the challenges and opportunities of machine learning in a real-world context. Despite the limitations we faced, we were able to develop a functioning model and gain practical experience in data preprocessing, model selection, and evaluation. We are proud of what we achieved as a team and look forward to applying the knowledge and skills gained here to future projects.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}