Group TN01 - Team SEML31
Dataset: Reference Annotations: The Beatles
Colab Notebook: Chord_Recognition.ipynb - Colab
This project addresses the problem of chord sequence recognition in music audio using a Hidden Markov Model (HMM).
Applications:
We begin with a brief review of the basic music theory concepts.
In the Western music system, one octave contains 12 notes, with A4 = 440 Hz as the standard tuning reference.
The pitch of all other notes is calculated relative to this reference.

Among various chord types, the two most common are:

Complex chords occur less frequently. Therefore, in this project, all uncommon chords are converted to either major or minor before training the model.
Each of the 12 notes can serve as the root of one major and one minor chord → 24 states.
An additional state N is added to represent silent sections (no chord).
The dataset consists of 50 songs by The Beatles, each including:
.lab) containing chord labels with timing:start_time end_time chord_label
Each song is randomly segmented into 8 sections to enable the model to learn intermediate musical segments.

Corresponding .lab files are generated for each audio segment.

Audio segment statistics:

Example: A_Day_In_The_Life.mp3

The feature extraction pipeline uses Librosa library with a sample rate of 22050 Hz and hop length of 512 samples .
Separates harmonic and percussive components to remove drum signals, which do not contribute to chord recognition.

Measures the energy of the 12 pitch classes in each time frame using Constant-Q Transform with 48 bins per octave for higher frequency resolution.

Tonal Centroid Features representing harmonic relationships between notes in tonal space. Captures tonal context beyond simple pitch content.
Measures the difference in amplitude between peaks and valleys in the spectrum across 6 frequency bands plus the mean. Captures timbral texture and harmonic richness.
Applied to all feature matrices (Chroma, Tonnetz, Spectral Contrast) using a 1D median filter with window size of 3 along the time axis. This technique removes outliers and smooths temporal variations while preserving sharp transitions between chords.
Each feature vector is normalized to unit L2 norm (Euclidean length = 1). This ensures that features are scale-invariant across different audio volume levels, making the model robust to loudness variations in the input audio.
In traditional HMMs, the emission matrix B is designed for discrete observations. However, audio features extracted (Chroma, Tonnetz) are continuous features.
Therefore, we use Gaussian Mixture Model (GMM) to model the probability distribution of features for each chord, replacing the discrete emission matrix.


Overall Accuracy: 42.77%

Although the dataset was augmented, the overall accuracy remains moderate.
The chords most frequently recognized correctly are Cmaj, Dmaj, Amaj, Gmaj, which also correspond to the most common chords identified in the EDA.
Future work will employ wavelet-based feature extraction to better capture the characteristics of music