Introduction
As part of my ongoing quest to create an AI-powered DJ (a.k.a. Project Mixin), I needed a way to automatically identify the most engaging and memorable parts of songs - the choruses. Having an understanding of a song’s musical structure is crucial for DJs, whether human or machine, as it allows them to strategically select and transition between the most impactful sections.
You can find relevant resources on this project on Github and try out the Streamlit app on YouTube songs via Hugging Face.
Now let’s dive in to how I developed a Convolutional Recurrent Neural Network (CRNN) to detect “drops” in electronic dance music (EDM)!
Dataset
To train the CRNN, I compiled a custom dataset of 332 songs, primarily sourced from popular EDM playlists on Spotify. My goal was to create a model that could accurately detect choruses in the types of music I enjoy listening to and DJing with. Existing annotated music datasets didn’t offer the relevance or quality of data I required for this task.
Drawing from my research background, I created a data labeling protocol to ensure consistency and rigor in the manual labeling process. For each song in the dataset, I labeled the start and end timestamps of the choruses, following the guidelines I had defined. Broadly, I defined a chorus as a core thematic segment that is distinct, thematically representative of the song, and aligns with the music structure. Additionally, the chorus must start on the first downbeat of a given bar, and end on the last downbeat of a bar.
Audio Preprocessing
Before training the model, the raw audio files needed to be preprocessed and transformed into a suitable format. This involved:
- Standardizing the audio format to mp3
- Trimming leading and trailing silence
- Downsampling the audio from 48 kHz to 12 kHz
Feature Extraction and Model Preprocessing
Part of my exploratory analysis involved visualizing various audio features over
Here are some visualizations of the extracted audio features, with the chorus sections highlighted in green:

Model Architecture
The CRNN model architecture consists of:
- Three 1D convolutional layers to extract local musical patterns
- A bidirectional LSTM layer to capture long-term temporal dependencies
- A time-distributed dense output layer for meter-wise chorus predictions
Results
After training for 18 epochs, the model achieved the following performance on the test set:
| Metric | Score |
|---|---|
| Accuracy | 0.891 |
| Precision | 0.831 |
| Recall | 0.900 |
| F1 | 0.864 |

The model generalizes reasonably well to songs from similar EDM genres as the training data. Here’s an example prediction, with the detected choruses highlighted:

Try It Yourself!
Want to see how well the model performs on your favorite tracks? You can try out the chorus detection model in two ways:
- Use the web demo on Hugging Face
- Run the Dockerized command-line tool locally
What’s Next?
With the ability to pinpoint the choruses, the next step in Project Mixin is to develop an intelligent song selection and transition system. The goal is to create an AI DJ that can read the vibe of a crowd and mix the best parts of tracks together in a musically and emotionally cohesive way. Stay tuned for updates!
In the meantime, I’d love to hear your thoughts on this chorus detection model. Feel free to leave a comment below or reach out on GitHub with any questions or feedback. And if you found this post interesting, consider giving the repo a ⭐ star to show your support!