Add Evaluation, Optimization and Conclusion.

2025-08-25 21:41:23 +02:00
parent 5ad3bbf435
commit c3c4ebc9a7
1 changed files with 109 additions and 41 deletions
--- a/notebook.ipynb
+++ b/notebook.ipynb
@@ -293,12 +293,12 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "    00  000  000km    000th  00am  00f  00i  00p  00v   01  ...  이터널  이터널리턴  \\\n",
+      "    00  000  000km     000th  00am  00f  00i  00p  00v   01  ...  이터널  이터널리턴  \\\n",
-      "0  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "0  0.0  0.0    0.0  0.000000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
-      "1  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "1  0.0  0.0    0.0  0.000000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
-      "2  0.0  0.0    0.0  0.14649   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "2  0.0  0.0    0.0  0.162349   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
-      "3  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "3  0.0  0.0    0.0  0.000000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
-      "4  0.0  0.0    0.0  0.00000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
+      "4  0.0  0.0    0.0  0.000000   0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0    0.0   \n",
      "\n",
      "   이현준  정대찬  중입니다   철권  토탈워  페르소나  한국어  한글을  \n",
      "0  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
@@ -307,14 +307,14 @@
      "3  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
      "4  0.0  0.0   0.0  0.0  0.0   0.0  0.0  0.0  \n",
      "\n",
-      "[5 rows x 29351 columns]\n"
+      "[5 rows x 29056 columns]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
-    "vectorizer = TfidfVectorizer()\n",
+    "vectorizer = TfidfVectorizer(stop_words='english')\n",
    "tfidf_matrix = vectorizer.fit_transform(dataset['desc']) # matrix, not pandas df\n",
    "tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())\n",
    "print(tfidf_df.head())"
@@ -448,10 +448,10 @@
    {
     "data": {
      "text/plain": [
-       "99"
+       "234"
      ]
     },
-     "execution_count": null,
+     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
@@ -470,7 +470,6 @@
    "\n",
    "# preparation of X\n",
    "del tfidf_df\n",
    "del vectorizer\n",
    "del tfidf_matrix\n",
    "\n",
    "# Initial Dataset\n",
@@ -560,13 +559,7 @@
   "execution_count": null,
   "id": "8c1d72c4532bd509",
   "metadata": {},
-   "outputs": [
+   "outputs": [],
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "from sklearn.svm import LinearSVC\n",
    "from sklearn.multioutput import MultiOutputClassifier\n",
@@ -584,7 +577,17 @@
   "metadata": {},
   "source": [
    "# Evaluation\n",
-    "**TODO Test the Model with the test data**"
+    "We evaluate our model by comparing the test data with the predicted data. We are using the worst case scenario by setting zero_division=0.0 in the classification report. This means that if a metric cannot be calculated due to division by zero, it is set to 0.0. Setting this parameter to 1.0 (best case) does not significantly change the results.\n",
    "\n",
    "Our approach involves training one model per genre, resulting in a total of 12 models. Each model predicts a specific genre, and the combined results of all models are shown at the bottom of the report. The input features are represented by x, and the output labels by y.\n",
    "\n",
    "Key metrics such as precision and recall are calculated for each class. These metrics indicate whether all classes are recognized and how accurate the predictions are. Notably, only one class achieves perfect 1.0 precision. For some reason, the Early Access class performs particularly poorly. The F1 score is also included in the evaluation, as it provides a balanced measure of precision and recall. The support column indicates the number of samples for each class.\n",
    "\n",
    "\n",
    "It is noteworthy that some of the top 10 words influencing the decision process are related to brands, such as \"ea\" in Sports, even though we removed the developer and publisher columns. Some words, like \"brokkoli\" in Racing, are not obviously related to the genre, which may indicate slight overfitting or the presence of only a few relevant but fitting data points in the dataset.\n",
    "\n",
    "Generally, a model is considered very good with an F1 score above 0.8, and good with a score above 0.7. In our case, the F1 scores are 0.69 and 0.54, which means our model performs moderately well up to good. The low macro and micro scores are mainly due to problematic classes, but overall, the weighted average and samples average are quite acceptable.\n",
    "\n"
   ]
  },
  {
@@ -597,25 +600,61 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "              precision    recall  f1-score   support\n",
+      "                       precision    recall  f1-score   support\n",
      "\n",
-      "           0       0.84      0.86      0.85       300\n",
+      "               Action       0.86      0.87      0.87       300\n",
-      "           1       0.74      0.63      0.68       216\n",
+      "            Adventure       0.74      0.66      0.70       216\n",
-      "           2       0.77      0.31      0.45        86\n",
+      "               Casual       0.79      0.22      0.35        86\n",
-      "           3       0.50      0.04      0.08        46\n",
+      "         Early Access       0.50      0.02      0.04        46\n",
-      "           4       0.69      0.33      0.44        83\n",
+      "         Free To Play       0.79      0.28      0.41        83\n",
-      "           5       0.79      0.80      0.79       245\n",
+      "                Indie       0.77      0.81      0.79       245\n",
-      "           6       0.69      0.26      0.38        42\n",
+      "Massively Multiplayer       0.89      0.19      0.31        42\n",
-      "           7       0.74      0.62      0.68       127\n",
+      "                  RPG       0.80      0.55      0.65       127\n",
-      "           8       1.00      0.67      0.80        12\n",
+      "               Racing       1.00      0.58      0.74        12\n",
-      "           9       0.80      0.57      0.67       127\n",
+      "           Simulation       0.86      0.50      0.64       127\n",
-      "          10       1.00      0.50      0.67        14\n",
+      "               Sports       1.00      0.29      0.44        14\n",
-      "          11       0.79      0.46      0.58       106\n",
+      "             Strategy       0.80      0.41      0.54       106\n",
      "\n",
-      "   micro avg       0.79      0.62      0.69      1404\n",
+      "            micro avg       0.81      0.60      0.69      1404\n",
-      "   macro avg       0.78      0.51      0.59      1404\n",
+      "            macro avg       0.82      0.45      0.54      1404\n",
-      "weighted avg       0.77      0.62      0.67      1404\n",
+      "         weighted avg       0.80      0.60      0.65      1404\n",
-      " samples avg       0.80      0.68      0.70      1404\n",
+      "          samples avg       0.81      0.66      0.69      1404\n",
      "\n",
      "Most important words of class 'Action':\n",
      "['action', 'weapons', 'shooter', 'fighting', 'fight', 'weapon', 'players', 'aim', 'gun', 'intense']\n",
      "\n",
      "Most important words of class 'Adventure':\n",
      "['adventure', 'explore', 'puzzles', 'smite', 'far', 'stories', 'remake', 'hunting', 'don', 'secrets']\n",
      "\n",
      "Most important words of class 'Casual':\n",
      "['puzzle', 'color', 'ball', 'smite', 'poker', 'click', 'communication', 'idle', 'cats', 'fun']\n",
      "\n",
      "Most important words of class 'Early Access':\n",
      "['early', 'pals', 'backrooms', 'automation', 'rotwood', 'access', 'design', 'vrchat', 'nephelym', 'idleon']\n",
      "\n",
      "Most important words of class 'Free To Play':\n",
      "['free', 'royale', 'mmo', 'pvp', 'arena', 'mmorpg', 'idle', 'cats', 'millions', 'team']\n",
      "\n",
      "Most important words of class 'Indie':\n",
      "['game', 'horror', 'building', 'different', 'vermintide', 'generated', 'roguelike', 'better', 'soundtrack', 'procedurally']\n",
      "\n",
      "Most important words of class 'Massively Multiplayer':\n",
      "['royale', 'mmorpg', 'players', 'mmo', 'pvp', 'ball', 'smite', 'scp', 'temtem', 'join']\n",
      "\n",
      "Most important words of class 'RPG':\n",
      "['rpg', 'loot', 'dungeons', 'combat', 'dungeon', 'character', 'fantasy', 'quests', 'skills', '觅长生']\n",
      "\n",
      "Most important words of class 'Racing':\n",
      "['cars', 'racing', 'car', 'race', 'speed', 'driving', 'brokkoli', 'ddnet', 'rally', 'jeff']\n",
      "\n",
      "Most important words of class 'Simulation':\n",
      "['simulator', 'realistic', 'simulation', 'physics', 'sandbox', 'building', 'workshop', 'management', 'car', 'idle']\n",
      "\n",
      "Most important words of class 'Sports':\n",
      "['racing', 'skate', 'sports', 'football', 'rally', 'virtual', 'ea', 'vrchat', 'hunting', 'realistic']\n",
      "\n",
      "Most important words of class 'Strategy':\n",
      "['strategy', 'turn', 'units', 'buildings', 'strategic', 'heroes', 'tactical', 'command', '觅长生', 'squad']\n",
      "\n"
     ]
    }
@@ -623,7 +662,18 @@
   "source": [
    "from sklearn.metrics import classification_report\n",
    "\n",
-    "print(classification_report(y_test, y_pred, zero_division=0.0))"
+    "print(classification_report(y_test, y_pred, target_names=y_test.columns, zero_division=0.0))\n",
    "\n",
    "feature_names = vectorizer.get_feature_names_out()\n",
    "class_names = y_test.columns\n",
    "\n",
    "for i, class_name in enumerate(class_names):\n",
    "    coef = multi_target_clf.estimators_[i].coef_.flatten()\n",
    "    # print the top 10 coefficients used\n",
    "    top10 = np.argsort(coef)[-10:]\n",
    "    print(f\"Most important words of class '{class_name}':\")\n",
    "    print([feature_names[j] for j in top10][::-1]) \n",
    "    print()"
   ]
  },
  {
@@ -632,7 +682,15 @@
   "metadata": {},
   "source": [
    "# Optimization\n",
-    "**TODO optimize the model based on the test results**"
+    "- Since our dataset contains multiple languages, it would be beneficial to either train a separate model for each language or to standardize the data before and remove the stop words specific to each language.\n",
    "\n",
    "- Hyperparameter validation should also be performed. For example, in LinearSVC, the C parameter controls the learning rate and could be further optimized.\n",
    "\n",
    "- Instead of a simple train-test split, k-fold cross validation should be used to achieve better data mixing and more robust results.\n",
    "\n",
    "- Additionally, ensemble learning methods could be considered to further improve performance.\n",
    "\n",
    "The biggest limitation of our dataset is the presence of too many languages but too few entries for each, which is also constrained by our computing resources."
   ]
  },
  {
@@ -641,13 +699,23 @@
   "metadata": {},
   "source": [
    "# Conclusion and outlook\n",
-    "**TODO Write a conclusion and outlook what can be done and where the issues were.**"
+    "Abgabe erstellen wo Github link. Link zu Dataset und NoteBook. Das alles in README packen. Git lfs erwähnen, dass er das downloaden kann. Und pdf erstellen.\n",
    "\n",
    "To conclude we can say that our model performs reasonably well for the intended application. With a larger dataset, the results would likely improve further. Considering the points mentioned above, it is quite impressive that the model achieves these results using only a small dataset and limited computational resources.\n",
    "\n",
    "Our collaboration as a team worked very smoothly throughout the project. Communication and planning were effective, allowing us to coordinate our tasks efficiently and make steady progress.\n",
    "\n",
    "The main challenge we faced was the limited computational resources available to us. Especially when working with the 10k dataset, training the models for statistical evaluation took a considerable amount of time. To address this, each team member ran different models in parallel on their own machines, with some training processes running for several days.\n",
    "\n",
    "Due to these computational constraints, we decided not to process the full dataset with 80,000 entries. Even though we had access to very powerful PCs equipped with the latest high-end components, the training times were still prohibitively long. As a result, we focused our efforts on the smaller datasets to ensure we could complete the project within a reasonable timeframe.\n",
    "\n",
    "In summary, this project provided us with valuable insights into the challenges and opportunities of machine learning in a real-world context. Despite the limitations we faced, we were able to develop a functioning model and gain practical experience in data preprocessing, model selection, and evaluation. We are proud of what we achieved as a team and look forward to applying the knowledge and skills gained here to future projects.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
@@ -661,7 +729,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.13.3"
+   "version": "3.13.5"
  }
 },
 "nbformat": 4,