SEML31

Assignment 1: Tabular Data

Group TN01 - Team SEML31

Colab Notebook: Open In Colab

Cover

The problem

Exploratory Data Analysis (EDA)

Preprocess

The preprocess pipeline has the following configurations:

The preprocess pipeline also splits the Date column into day, month, year; drop the rows where the label RainTomorrow is NaN; and splits the dataset into train and test sets.

Training result

The following preprocess configuration is chosen for training. We have a total of 8 preprocessing combinations.

  1. fill_na: median, mean
  2. cat_encode: ordinal, onehot
  3. pca_variance: 0.9, None

We used 5 machine learning models, trained using Scikit-learn and PyTorch with the following configurations:

  1. K-Nearest Neighbors, n_neighbors: 5, 11
  2. Logistic Regression, penalty: l2, None
  3. Random Forests, n_estimators: 50, 100
  4. XGBoost, learning_rate: 0.1, 0.3
  5. MLP (pytorch), num_hidden: 2, 3

In total, with 10 model configurations, each trained with 8 preprocessed configurations, we get a total of 80 models. Here we only show the result table of XGBoost, with the highest f1-score model highlighted (click here to see the rest):

model_config fill_na encode pca accuracy precision recall f1
0.1 median ordinal 0.9 0.818974 0.818315 0.78145 0.799458
0.1 median ordinal None 0.840999 0.851672 0.79392 0.821783
0.1 median onehot 0.9 0.819261 0.816677 0.784723 0.800382
0.1 median onehot None 0.839919 0.851417 0.791426 0.820326
0.1 mean ordinal 0.9 0.819333 0.818256 0.782541 0.8
0.1 mean ordinal None 0.846757 0.863097 0.794076 0.827149
0.1 mean onehot 0.9 0.819405 0.817045 0.784567 0.800477
0.1 mean onehot None 0.845966 0.864574 0.790179 0.825705
0.3 median ordinal 0.9 0.829986 0.829138 0.795791 0.812122
0.3 median ordinal None 0.846469 0.851907 0.80795 0.829346
0.3 median onehot 0.9 0.826315 0.825049 0.791738 0.80805
0.3 median onehot None 0.849061 0.853934 0.812003 0.832441
0.3 mean ordinal 0.9 0.831138 0.829154 0.798909 0.81375
0.3 mean ordinal None 0.850932 0.860438 0.808262 0.833534
0.3 mean onehot 0.9 0.830922 0.830569 0.796259 0.813052
0.3 mean onehot None 0.850788 0.861352 0.806703 0.833132

Comparing the highest f1 score configurations in each type of model against each other:

classifier model_config fill_na encode pca accuracy precision recall f1
K-Nearest Neighbors 11 mean onehot None 0.814223 0.813645 0.775214 0.793965
Logistic Regression None median onehot None 0.777442 0.769243 0.739984 0.754330
Random Forest 100 mean ordinal None 0.836536 0.845103 0.790959 0.817135
XGBoost 0.3 mean ordinal None 0.850932 0.860438 0.808262 0.833534
MLP 2 median onehot None 0.815950 0.781770 0.834295 0.807179

Remark