Due to microphone permissions, this webpage should be opened on Chrome or Firefox browsers and not on mobile devices.
A holy grail for any songwriter is to understand how harmony functions in their favorite songs. Whether you are a guitarist strumming and singing, a pianist at a wedding, or a full orchestra composer, it is harmony which organizes and structures the music we love. Future products to create compelling playlists, teach students, and help composers will need to dive deep into the rich panoply of music that has already been created, but we currently do not have reliable tools to allow us to conduct this analysis at the scale we need.
Current efforts, like the musicology project found on this website, and HookTheory.com, rely on human-input to accomplish the task, and do not use consistent implementations since analysis is inherently subjective, making generalizations difficult. But what if we could use the latest developments in deep learning to do it for us? This would open up a huge landscape for meaningful musical analysis. Imagine an auto-generated playlist that shows the development of a chord progression through the past 80 years, weekly updates on which progressions are charting today, or a compositional tool that could give you quick thumbnails of harmonically similar songs so you can identify exactly which tweaks will help your song stand out.
The possibilities are endless, and that is what we aim to accomplish using reliable techniques like deep learning. Harmony classification is just the first step of data-reduction techniques to help machine learning models do a better job on music. We hope to help advance the current state-of-the-art to new domains never before dreamed possible.
Unbaking a Cake: Why Is It Hard For Computers To Hear Music?
At first I was surprised to find out that computers aren’t good at identifying the notes in a sound. After all, don’t we have a thing called the Fourier transform that tells you all the frequencies in a sound? Indeed there is, but there is also a major problem!
A note produced by a real-world instrument does indeed produce a peak in the Fourier transform at its fundamental frequency, but it also produces peaks at multiples of that frequency called harmonics. The problem is that the harmonics from different notes, particularly the ones in chords, overlap. It becomes difficult to distinguish which notes are responsible for a given frequency response without kludgy methods, of which legion have been devised. As a result, this remains an unsolved problem, with various stabs at it of varying but generally not high effectiveness. There is no simple or elegant method that researchers have agreed upon.
But can we find an elegant solution using deep learning? Indeed we can, and the solution demonstrates its enormous power to adapt to any problem with the right approach!
Approach: Artificially Create Lots of Chords
Deep learning trains on huge datasets to identify the common elements and interactions among them that define the datapoints associated with a given label. In the case of harmony, this would be a huge selection of different chords, produced through different methods, that are still all the same chord. To produce these chords, we first create samples of notes from a variety of instruments (aka banks) using Max and Ableton Live. Then our python script assembles samples from notes that constitute a given chord with varying weights and banks, applies noise, pitch shifting, and inter-note pitch shifting.
We then apply the fourier transform to these samples, select one frame, use its absolute values, and send that signal through a funnel-shaped fully-connected neural net which then trains on which pitch classes (notes without their associated octaves) are present in the given sample. With a sufficiently comprehensive dataset, a sufficiently large neural net, and sufficiently long training, we should arrive a net that can classify new and potentially real-world samples!
How Well Does It Work?
The training procedure gives us both training and validation accuracies at the element level (how many pitch classes were correctly identified) and the row level (how many chords were correctly identified). We find that if we sufficiently increase the percussion noise, or set the net to be too small, the model will “crap out” at accuracies below 100%.
Interestingly, we find good performance when we include training data that contains seventh chords and extensions, but trains on outputs that only include the base triad (root, maj/min third, and fifth) or quadrad (root, maj/min third, fifth, maj/min/missing seventh).
To extrapolate to real-world situations, we also include a couple of real recordings of guitar chords: Emaj, Amaj, and a pitch-shifted version of Emaj up a semitone to Fmaj. We find that the current code correctly identifies all their pitch classes except for one in the Amaj chord.
We can also run the net on slices of the song “Something” by The Beatles, and currently it produces garbage. We hope that upon transferring our code to a server, and greatly increasing the size and variety of the input data, we can improve performance. It would also behoove us to make more real-world examples to test on, although our current focus is on generating a greater variety of training data.