SEML31

Assignment 2: Text Data

Group TN01 - Team SEML31

Colab Notebook: Assignment 2

1. Overview

2. Exploratory Data Analysis (EDA)

3. Preprocessing

We applied the following pipeline to clean the raw text:

  1. Lowercase: Normalize all text to lowercase.
  2. Noise Removal: Remove URLs, usernames (e.g., @user), and special characters/punctuation.
  3. Stop Word Removal: Applied only for TF-IDF models to reduce noise.
  4. Lemmatization: Convert words to their base forms (e.g., learning $\to$ learn) to consolidate vocabulary.

4. Feature Extraction

Due to the dataset size, we utilized a stratified subset of 100,000 samples for training and testing. We employed two distinct embedding strategies:

  1. TF-IDF (Bag of Words): Focuses on word frequency. We removed stop words to prioritize meaningful content.
    • Variants: Max features (2500 vs 5000) and Dimensionality Reduction (SVD 300).
  2. BERT (Contextual Embeddings): A pre-trained transformer model. We kept stop words because BERT relies on sentence structure and context to generate accurate embeddings (768 dimensions).

Feature Configurations:

ID Feature Type Dimensions Preprocessing
1 TF-IDF 2500 No stop words
2 TF-IDF 5000 No stop words
3 TF-IDF + SVD 2500 -> 300 No stop words
4 TF-IDF + SVD 5000 -> 300 No stop words
5 BERT 768 Keep stop words

5. Modeling Strategy

We evaluated 30 models across three categories using Accuracy, Recall, Precision, and F1-score.

A. Linear Models

B. Tree-based Models

C. Neural Networks (MLP)

6. Results & Discussion

Logistic Regression

classifier model_config preprocessing accuracy precision recall f1
Logistic Regression 0.5 tfidf_2500 0.7721 0.7911 0.7635 0.7771
Logistic Regression 0.5 tfidf_5000 0.7760 0.7941 0.7678 0.7807
Logistic Regression 0.5 bert 0.7834 0.7866 0.7831 0.7848
Logistic Regression 1.0 tfidf_2500 0.7728 0.7925 0.7639 0.7779
Logistic Regression 1.0 tfidf_5000 0.7778 0.7957 0.7695 0.7824
Logistic Regression 1.0 bert 0.7831 0.7865 0.7826 0.7846
Logistic Regression 2.0 tfidf_2500 0.7720 0.7929 0.7624 0.7773
Logistic Regression 2.0 tfidf_5000 0.7782 0.7962 0.7698 0.7828
Logistic Regression 2.0 bert 0.7831 0.7859 0.7829 0.7844

Linear SVC

classifier model_config preprocessing accuracy precision recall f1
Linear SVC 0.5 tfidf_2500 0.7702 0.7952 0.7587 0.7765
Linear SVC 0.5 tfidf_5000 0.7750 0.7962 0.7652 0.7804
Linear SVC 0.5 bert 0.7839 0.7883 0.7828 0.7856
Linear SVC 1.0 tfidf_2500 0.7692 0.7943 0.7577 0.7755
Linear SVC 1.0 tfidf_5000 0.7724 0.7931 0.7630 0.7778
Linear SVC 1.0 bert 0.7839 0.7883 0.7828 0.7856
Linear SVC 2.0 tfidf_2500 0.7690 0.7941 0.7576 0.7754
Linear SVC 2.0 tfidf_5000 0.7710 0.7920 0.7615 0.7764
Linear SVC 2.0 bert 0.7839 0.7884 0.7828 0.7856

Random Forest

classifier model_config preprocessing accuracy precision recall f1
Random Forest 50.0 svd_2500 0.7056 0.6891 0.7145 0.7016
Random Forest 50.0 svd_5000 0.7087 0.6867 0.7202 0.7030
Random Forest 100.0 svd_2500 0.7160 0.7071 0.7216 0.7143
Random Forest 100.0 svd_5000 0.7191 0.7026 0.7284 0.7152

XGBoost

classifier model_config preprocessing accuracy precision recall f1
XGBoost 50.0 svd_2500 0.7242 0.7430 0.7177 0.7301
XGBoost 50.0 svd_5000 0.7282 0.7383 0.7253 0.7318
XGBoost 50.0 bert 0.7563 0.7579 0.7570 0.7575
XGBoost 100.0 svd_2500 0.7260 0.7435 0.7200 0.7315
XGBoost 100.0 svd_5000 0.7314 0.7432 0.7277 0.7354
XGBoost 100.0 bert 0.7621 0.7621 0.7636 0.7629

MLP

classifier model_config preprocessing accuracy precision recall f1
MLP Dropout + Scheduler + Early Stopping bert 0.7909 0.7738 0.8026 0.7879
MLP Dropout + Scheduler + Early Stopping svd 0.7437 0.7524 0.7411 0.7467

Best Models

classifier model_config preprocessing accuracy precision recall f1
Logistic Regression 0.5 bert 0.7834 0.7866 0.7831 0.7848
Linear SVC 2.0 bert 0.7839 0.7884 0.7828 0.7856
Random Forest 100.0 svd_5000 0.7191 0.7026 0.7284 0.7152
XGBoost 100.0 bert 0.7621 0.7621 0.7636 0.7629
MLP Dropout + Scheduler + Early Stopping bert 0.7909 0.7738 0.8026 0.7879

7. Key Insights

  1. BERT Supremacy: BERT embeddings consistently outperformed TF-IDF across almost all classifiers. This confirms that capturing context and semantic meaning (bidirectional) is superior to simple keyword frequency (TF-IDF) for sentiment analysis, especially in short, informal texts like tweets.

  2. Effectiveness of Linear Models: Surprisingly, Linear Models (Logistic Regression/SVC) performed very competitively with TF-IDF features.
    • Insight: High-dimensional sparse text data is often linearly separable.
    • Regularization: Lower $C$ values (0.5) yielded slightly better accuracy (~0.5% gain), suggesting that stronger regularization helps generalize better on noisy social media data.
  3. Tree-based Model Limitations: Random Forest and XGBoost showed only marginal improvements when increasing estimators from 50 to 100 (~1% gain). Given the high computational cost, the return on investment for scaling up tree models on this specific task is low compared to using a better embedding (BERT).

  4. MLP Performance: The MLP trained on BERT embeddings achieved the highest overall metrics. The combination of dense, rich semantic vectors from BERT and the non-linear learning capability of the MLP allowed the model to capture subtle sentiment nuances that linear models missed.