Dataset: DBpedia Ontology
The DBpedia Ontology Classification Dataset, known as dbpedia_14, is a comprehensive and meticulously constructed dataset containing a vast collection of text samples. These samples have been expertly classified into 14 distinct and non-overlapping classes.
Each text sample include a title, which succinctly summarizes the main topic or subject matter of the text sample, and content that comprehensively covers all relevant information related to a specific topic.
https://www.unitxt.ai/en/main/catalog/catalog.cards.dbpedia_14.html
{
0: "Company",
1: "Educational Institution",
2: "Artist",
3: "Athlete",
4: "Office Holder",
5: "Mean Of Transportation",
6: "Building",
7: "Natural Place",
8: "Village",
9: "Animal",
10: "Plant",
11: "Album",
12: "Film",
13: "Written Work",
}
We begin by inspecting the dataset structure and basic statistics:

To understand the class distribution, we visualize the counts of each label, ensuring there is a balanced representation across all 14 categories:

Next, we evaluate the length of the text data by analyzing the distribution of word counts in both the titles and content sections:

We also extract and visualize the most frequent words found in the content to gain insight into the common vocabulary used throughout the dataset:

For faster experimentation and training efficiency in this assignment, a subset of the data is randomly sampled (50,000 samples for training, 10,000 for testing).
We use the distilbert-base-uncased tokenizer from Hugging Face (DistilBertTokenizerFast) to tokenize the text sequence, ensuring a maximum sequence length of 128 tokens. Special tokens are added, texts are padded to max_length, and truncated appropriately to maintain fixed-size input tensors.
We compare two different neural network architectures for text classification:
A recurrent neural network leveraging the context of word sequences.
A Transformer-based architecture known for powerful natural language understanding.
distilbert-base-uncased model.[CLS] token position) predicting 14 classes, with Dropout (0.3).nn.CrossEntropyLoss).1e-32e-5 (to not override pre-trained weights aggressively)The training and validation curves demonstrating Loss and Accuracy over epochs for both the LSTM and DistilBERT models are shown below:

Finally, we comprehensively evaluate the performance of both models after training. The comparison table below highlights training times, final losses, and evaluation metrics, demonstrating the distinct trade-offs between an efficient bidirectional RNN architecture and a more parameter-heavy but robust transformer mode like DistilBERT.

Based on the data exploration and the comparative training results of the two models, we can draw the following key insights:
p=0.3) and patience-based Early Stopping were crucial in ensuring both models could gracefully halt training and generalize to the unseen test set, rather than simply memorizing the training data across the 20 epochs.