deep_learning_assignment

Assignment 1: Multimodal classification

Dataset: UPMC Food-101

VISIIR

For this project, we created the UPMC Food-101 dataset. This dataset contains 101 food categories. For each of them constituted, we gathered around 800 to 950 images from a Google Image seach of the title of the category. Because of this, this dataset may contain some noise.

Exploratory Data Analysis (EDA)

The dataset consists of food images, title text, and their corresponding 101 categories. Let’s take a look at the data structure.

Dataframe Info Dataframe Head

The images are organized into 101 distinct classes.

101 Category List Folder Structure

Class Distribution

Visualizing the category distribution shows the class imbalance across the training set:

Category Distribution

Title Length Analysis

Analyzing the word counts of the text titles shows the distribution in the dataset:

Title Word Count Distribution

Sample Images

Visualizing some sample images paired with their categories and text titles:

Visualize Image Title Category

Image Dimensions

Checking the image sizes across a sample:

Sample Image Dimensions

Model Approaches

In this dataset, each sample contains:

We test two different classification approaches using the pre-trained CLIP model (openai/clip-vit-base-patch32):

Approach 1: Zero-shot

We encode the 101 classes into text prompts styled as “A photo of {class}”. For each sample, the model passes the image through the image encoder. It then uses Cosine similarity with the 101 text features to find the best match. This relies entirely on CLIP’s pre-trained knowledge. It does not use the title text and requires no training.

Results: Zero Shot Confusion Matrix Zero Shot Top 10 Errors

Approach 2: Few-shot Multimodal MLP

For each sample, we pass the image into the CLIP image encoder and the specific sample text (title) into the CLIP text encoder to get both embeddings (512-dim each). These embeddings are concatenated, forming a 1024-dim vector, and passed through a custom Multi-Layer Perceptron (MLP) with a ReLU and Dropout layer to generate the 101-class output logits. This model is trained explicitly using the text, image, and class labels.

Training Progress: Train Graph

Results: Few Shot Confusion Matrix Few Shot Top 10 Errors

Comprehensive Insights and Analysis

Comparison Table

We evaluated two core capabilities of the CLIP base model:

  1. Zero-shot: Directly compared image embeddings to the texts “A photo of {class}” via Cosine Similarity, yielding base capabilities.
  2. Few-shot Multimodal MLP: Jointly passed Image embeddings & Text embeddings from the titles into an MLP and trained it. This approach leverages text features unique for each sample and fits to the specific UPMC dataset.

Analysis