Group TN01 - Team SEML31
Colab Notebook: Assignment 2


We applied the following pipeline to clean the raw text:
Due to the dataset size, we utilized a stratified subset of 100,000 samples for training and testing. We employed two distinct embedding strategies:
Feature Configurations:
| ID | Feature Type | Dimensions | Preprocessing |
|---|---|---|---|
| 1 | TF-IDF | 2500 | No stop words |
| 2 | TF-IDF | 5000 | No stop words |
| 3 | TF-IDF + SVD | 2500 -> 300 | No stop words |
| 4 | TF-IDF + SVD | 5000 -> 300 | No stop words |
| 5 | BERT | 768 | Keep stop words |
We evaluated 30 models across three categories using Accuracy, Recall, Precision, and F1-score.
| classifier | model_config | preprocessing | accuracy | precision | recall | f1 |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.5 | tfidf_2500 | 0.7721 | 0.7911 | 0.7635 | 0.7771 |
| Logistic Regression | 0.5 | tfidf_5000 | 0.7760 | 0.7941 | 0.7678 | 0.7807 |
| Logistic Regression | 0.5 | bert | 0.7834 | 0.7866 | 0.7831 | 0.7848 |
| Logistic Regression | 1.0 | tfidf_2500 | 0.7728 | 0.7925 | 0.7639 | 0.7779 |
| Logistic Regression | 1.0 | tfidf_5000 | 0.7778 | 0.7957 | 0.7695 | 0.7824 |
| Logistic Regression | 1.0 | bert | 0.7831 | 0.7865 | 0.7826 | 0.7846 |
| Logistic Regression | 2.0 | tfidf_2500 | 0.7720 | 0.7929 | 0.7624 | 0.7773 |
| Logistic Regression | 2.0 | tfidf_5000 | 0.7782 | 0.7962 | 0.7698 | 0.7828 |
| Logistic Regression | 2.0 | bert | 0.7831 | 0.7859 | 0.7829 | 0.7844 |
| classifier | model_config | preprocessing | accuracy | precision | recall | f1 |
|---|---|---|---|---|---|---|
| Linear SVC | 0.5 | tfidf_2500 | 0.7702 | 0.7952 | 0.7587 | 0.7765 |
| Linear SVC | 0.5 | tfidf_5000 | 0.7750 | 0.7962 | 0.7652 | 0.7804 |
| Linear SVC | 0.5 | bert | 0.7839 | 0.7883 | 0.7828 | 0.7856 |
| Linear SVC | 1.0 | tfidf_2500 | 0.7692 | 0.7943 | 0.7577 | 0.7755 |
| Linear SVC | 1.0 | tfidf_5000 | 0.7724 | 0.7931 | 0.7630 | 0.7778 |
| Linear SVC | 1.0 | bert | 0.7839 | 0.7883 | 0.7828 | 0.7856 |
| Linear SVC | 2.0 | tfidf_2500 | 0.7690 | 0.7941 | 0.7576 | 0.7754 |
| Linear SVC | 2.0 | tfidf_5000 | 0.7710 | 0.7920 | 0.7615 | 0.7764 |
| Linear SVC | 2.0 | bert | 0.7839 | 0.7884 | 0.7828 | 0.7856 |
| classifier | model_config | preprocessing | accuracy | precision | recall | f1 |
|---|---|---|---|---|---|---|
| Random Forest | 50.0 | svd_2500 | 0.7056 | 0.6891 | 0.7145 | 0.7016 |
| Random Forest | 50.0 | svd_5000 | 0.7087 | 0.6867 | 0.7202 | 0.7030 |
| Random Forest | 100.0 | svd_2500 | 0.7160 | 0.7071 | 0.7216 | 0.7143 |
| Random Forest | 100.0 | svd_5000 | 0.7191 | 0.7026 | 0.7284 | 0.7152 |
| classifier | model_config | preprocessing | accuracy | precision | recall | f1 |
|---|---|---|---|---|---|---|
| XGBoost | 50.0 | svd_2500 | 0.7242 | 0.7430 | 0.7177 | 0.7301 |
| XGBoost | 50.0 | svd_5000 | 0.7282 | 0.7383 | 0.7253 | 0.7318 |
| XGBoost | 50.0 | bert | 0.7563 | 0.7579 | 0.7570 | 0.7575 |
| XGBoost | 100.0 | svd_2500 | 0.7260 | 0.7435 | 0.7200 | 0.7315 |
| XGBoost | 100.0 | svd_5000 | 0.7314 | 0.7432 | 0.7277 | 0.7354 |
| XGBoost | 100.0 | bert | 0.7621 | 0.7621 | 0.7636 | 0.7629 |
| classifier | model_config | preprocessing | accuracy | precision | recall | f1 |
|---|---|---|---|---|---|---|
| MLP | Dropout + Scheduler + Early Stopping | bert | 0.7909 | 0.7738 | 0.8026 | 0.7879 |
| MLP | Dropout + Scheduler + Early Stopping | svd | 0.7437 | 0.7524 | 0.7411 | 0.7467 |
| classifier | model_config | preprocessing | accuracy | precision | recall | f1 |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.5 | bert | 0.7834 | 0.7866 | 0.7831 | 0.7848 |
| Linear SVC | 2.0 | bert | 0.7839 | 0.7884 | 0.7828 | 0.7856 |
| Random Forest | 100.0 | svd_5000 | 0.7191 | 0.7026 | 0.7284 | 0.7152 |
| XGBoost | 100.0 | bert | 0.7621 | 0.7621 | 0.7636 | 0.7629 |
| MLP | Dropout + Scheduler + Early Stopping | bert | 0.7909 | 0.7738 | 0.8026 | 0.7879 |
BERT Supremacy: BERT embeddings consistently outperformed TF-IDF across almost all classifiers. This confirms that capturing context and semantic meaning (bidirectional) is superior to simple keyword frequency (TF-IDF) for sentiment analysis, especially in short, informal texts like tweets.
Tree-based Model Limitations: Random Forest and XGBoost showed only marginal improvements when increasing estimators from 50 to 100 (~1% gain). Given the high computational cost, the return on investment for scaling up tree models on this specific task is low compared to using a better embedding (BERT).