stemgen

StemGen: A music generation model that listens

Julian Parker, Janne Spijkervet, Katerina Kosta, Furkan Yesiler, Boris Kuznetsov, Ju-Chiang Wang, Matt Avent, Jitong Chen, Duc Le

arXiv

Accepted at ICASSP24

Contents

Overview

StemGen is an end-to-end music generation model, trained to listen to musical context and respond appropriately. It’s built on a non-autoregressive language-model type architecture similar to SoundStorm and VampNet. More details are available in the paper.

This page presents a number of example outputs from models of this architecture.

Models / datasets

We present examples from three different models here:

Name Dataset Conditioning Tokenizer Params
slakh Slakh2100 Target instrument category 32kHz Encodec ~250M
internal internal dataset of 500 hours of human-played music available in individual instrument stems Target instrument category 32kHz Encodec ~250M
mingus * Pretrained on 500 hours of synthetic data from an internal symbolic music generation model. Fine tuned on 2hrs of high quality human composed and produced music. Genre category, target stem category stereo 48kHz Encodec ~250M

* not presented in paper

Test set examples

These examples are produced by constructing context audio from the test dataset partition, and using a model to generate a single stem (conditioning chosen at random) in response. They therefore closely reflect the task presented to the model at training time.

Model slakh slakh
Target category Guitar Guitar
Context
Generated stem
Mixed
Model slakh slakh
Target category Synth Drums
Context
Generated stem
Mixed
Model slakh slakh
Target category Drums Drums
Context
Generated stem
Mixed
Model internal internal
Target category Guitar Drums
Context
Generated stem
Mixed

Iterative generation examples

These examples are produced by providing the models with an arbitrary piece of context audio as a starting point. A new stem is generated from that context, and mixed with the existing audio. This new mixed audio is used as the context to generate another stem, and the process repeats. These examples therefore represent a much more challenging situation for the models, as they need to listen to and interpret both out-of-distribution audio and their own output.

These examples reflect a more typical use-case of StemGen, with a user constructing music iteratively in a chat-like environment. In order to preserve how this interactive process proceeded, we also show situations where the user generated multiple variations of a particular stem. These are denoted as variations of the iteration. When one is chosen to proceed for further iteration, it is marked in italics

Starting from drums

In this example, we use a short drum loop composed by the authors in Ableton Live as a starting point.

This example is chosen to demonstrate how the models follow rhythm, and also how they can generate coherant harmonic and melodic elements even with no harmonic or melodic information provided at the start of the process.

slakh

Iteration 1 2 (var. 1) 2 (var. 2) 2 (var. 3)
Conditioning Bass Piano Piano Woodwind
Context
Generated stem
Mixed

internal

Italic denotes which variation was used to continue generation

Iteration 1 (var. 1) 1 (var. 2) 2 3
Conditioning Guitar Guitar Percussion Bass
Context
Generated stem
Mixed

mingus

Iteration 1 2 3 4
Conditioning Electronic, Melodic Electronic, Harmonic Electronic, Harmonic Electronic, Percussive
Context
Generated stem
Mixed

Starting from chords

In this example, we use a short synth chord sequence composed by the authors in Ableton Live as a starting point.

This example is targeted to show how the models can respond to harmony in a musically plausibly way.

slakh

Iteration 1 2
Conditioning Piano Drums
Context
Generated stem
Mixed

internal

Iteration 1 2 3
Conditioning Guitar Guitar Bass
Context
Generated stem
Mixed

Deep iterative layering

In this example we start generation from silence, and repeatedly ask the slakh model to generate a piano stem, which is layered on top of the existing context (somewhat inspired by Alvin Lucier). We go through 9 iterations, after which we ask the model to generate a string stem and a bass stem. This example demonstrates the ability of the model to senstively add musical content even when presented with a musically dense input, whilst maintaining rhythmic and harmonic coherance.

In this audio example the iterations are presented sequentially over time. The raw iterations and stems are available here.

Live interactive music generation demos

In this example we built a prototype of a real-time musical performance device, based on the mingus StemGen model. The application allows looping of 4 channels of audio, with the ability to apply reverb, delay and a DJ-style lowpass/highpass filter to each channel. Each channel has a ‘generate’ button denoted by the robot icon, which provides the StemGen model with the current mixed loop as context and returns a stem of the desired genre and type.

Using this device, a user can build up a musical composition interactively in real-time by requesting new stems, blending them with existing content, and manipulating them.

The below videos give two examples of improvisations with the device.

This real-time device was built with the additional collaboration of David Trevelyan, Gleb Mineev and Peter Glushkov.