Automated Chorus Detection: Finding the Drops in Electronic Dance Music with Deep Learning

Dennis Dang | Jun 18, 2024 min read

Introduction

As part of my ongoing quest to create an AI-powered DJ (a.k.a. Project Mixin), I needed a way to automatically identify the most engaging and memorable parts of songs - the choruses. Having an understanding of a song’s musical structure is crucial for DJs, whether human or machine, as it allows them to strategically select and transition between the most impactful sections.

You can find relevant resources on this project on Github and try out the Streamlit app on YouTube songs via Hugging Face.

Now let’s dive in to how I developed a Convolutional Recurrent Neural Network (CRNN) to detect “drops” in electronic dance music (EDM)!

Dataset

To train the CRNN, I compiled a custom dataset of 332 songs, primarily sourced from popular EDM playlists on Spotify. My goal was to create a model that could accurately detect choruses in the types of music I enjoy listening to and DJing with. Existing annotated music datasets didn’t offer the relevance or quality of data I required for this task.

Drawing from my research background, I created a data labeling protocol to ensure consistency and rigor in the manual labeling process. For each song in the dataset, I labeled the start and end timestamps of the choruses, following the guidelines I had defined. Broadly, I defined a chorus as a core thematic segment that is distinct, thematically representative of the song, and aligns with the music structure. Additionally, the chorus must start on the first downbeat of a given bar, and end on the last downbeat of a bar.

Audio Preprocessing

Before training the model, the raw audio files needed to be preprocessed and transformed into a suitable format. This involved:

  1. Standardizing the audio format to mp3
  2. Trimming leading and trailing silence
  3. Downsampling the audio from 48 kHz to 12 kHz

Feature Extraction and Model Preprocessing

Part of my exploratory analysis involved visualizing various audio features over

Here are some visualizations of the extracted audio features, with the chorus sections highlighted in green:

Audio Features

Model Architecture

The CRNN model architecture consists of:

  • Three 1D convolutional layers to extract local musical patterns
  • A bidirectional LSTM layer to capture long-term temporal dependencies
  • A time-distributed dense output layer for meter-wise chorus predictions

Results

After training for 18 epochs, the model achieved the following performance on the test set:

Metric Score
Accuracy 0.891
Precision 0.831
Recall 0.900
F1 0.864

Confusion Matrix

The model generalizes reasonably well to songs from similar EDM genres as the training data. Here’s an example prediction, with the detected choruses highlighted:

Chorus Prediction

Try It Yourself!

Want to see how well the model performs on your favorite tracks? You can try out the chorus detection model in two ways:

  1. Use the web demo on Hugging Face
  2. Run the Dockerized command-line tool locally

What’s Next?

With the ability to pinpoint the choruses, the next step in Project Mixin is to develop an intelligent song selection and transition system. The goal is to create an AI DJ that can read the vibe of a crowd and mix the best parts of tracks together in a musically and emotionally cohesive way. Stay tuned for updates!

In the meantime, I’d love to hear your thoughts on this chorus detection model. Feel free to leave a comment below or reach out on GitHub with any questions or feedback. And if you found this post interesting, consider giving the repo a ⭐ star to show your support!