Teaching a machine to listen

I’m building the sample library manager I’ve always wanted. One of its core features is detecting the musical key of audio files so you can search your entire sample library by key.

The classic approach, and why it falls short

Detecting the musical key of an audio file has been studied for decades and still isn’t solved cleanly. The classic approach, and the one I implemented first, is an algorithm called Krumhansl-Schmuckler.

The idea: take an audio file, run it through a Fast Fourier Transform to extract which frequencies are present, then map those frequencies to the twelve pitch classes (C, C♯, D, and so on). This gives you a chromagram, a fingerprint of which notes are most prominent in the audio. Then you compare that fingerprint against known profiles for all 24 major and minor keys. Best match wins.

It works. The problem is that relative major and minor keys share almost all the same notes. C major and A minor use the exact same pitches. The difference is which one feels like “home.” A statistical profile of pitch energy often can’t tell the difference, because the distinction between major and minor lives in context, emphasis, and feel. Exactly the sort of thing that’s difficult to capture in a formula but easy to learn from examples. Which is what pushed me toward letting a model figure it out on its own.

Training a neural network to hear key

Rather than following hand-crafted rules about pitch class distributions, I trained a convolutional neural network to learn its own patterns from real audio. Feed in spectrograms, output one of 24 key labels.

I used the GiantSteps key dataset, about 604 electronic music tracks with human-verified key annotations. It’s a solid, well-regarded dataset in the music information retrieval community. But 604 tracks is not a lot of training data, and the distribution makes it worse.

There are 24 possible keys (12 pitches, each major or minor) but real music isn’t evenly distributed across them. The training set is heavily skewed. The model gets plenty of examples for common keys and learns them well. But with rare keys, it might see ten examples total, which isn’t enough to learn much of anything.

In machine learning terms, this is called a class imbalance problem.

I spent a while reading about standard approaches to class imbalance (oversampling, synthetic data generation, loss weighting) before realizing I was overcomplicating it. Music has a property most datasets don’t: transposition preserves structure. A track in C major, shifted up 5 semitones, becomes a track in F major. Same patterns, different label. Shift every track into all 12 keys and your 604 tracks become 7,248, perfectly balanced. It’s the kind of solution that feels obvious the moment you see it.

Sound as image

With the data problem solved, training itself is the part that fascinates me most. A CNN applied to audio spectrograms is essentially treating sound as an image. The x-axis is time, the y-axis is frequency, and the brightness of each pixel is the energy at that frequency at that moment. A kick drum shows up as a vertical flash. A sustained note is a horizontal streak. A chord is a cluster of these streaks stacked together.

This is richer information than the chromagram approach I started with. A chromagram tells you which notes are present. A spectrogram tells you which notes are present, when they appear, and how they move. That “when” turns out to matter a lot. Key isn’t just about which pitches show up. It’s about which ones get emphasis, which ones resolve, which ones feel like home. The chromagram can’t see that. The spectrogram can, or at least it gives the model enough to work with.

Musicians spend years developing an ear for this. They learn intervals, practice identifying chords, internalize how tension and resolution feel. This model just stared at a bunch of frequency images and started figuring it out. It has no concept of tension. It’s finding statistical patterns in pixel grids that happen to line up with something humans experience as emotion.

What I didn’t expect

I went into this expecting the technical challenge to be the interesting part. And it is. The FFT math, the chromagram extraction, the augmentation strategy, the model architecture. All of it is genuinely fascinating if you’re the kind of person who likes pulling things apart to see how they work.

But what I didn’t expect was how compelling it would be to simply watch a model learn. There’s something almost hypnotic about watching accuracy climb from 9% to 71% over the course of an hour, knowing that inside that black box, some approximation of musical understanding is taking shape. Not understanding the way a musician understands it. Something stranger, and in its own way, more interesting.

I don’t know what the final accuracy will be when training finishes. But I’ll be watching the logs the whole time.