stemgen

StemGen: A music generation model that listens

Julian Parker, Janne Spijkervet, Katerina Kosta, Furkan Yesiler, Boris Kuznetsov, Ju-Chiang Wang, Matt Avent, Jitong Chen, Duc Le

arXiv

Accepted at ICASSP24

StemGen: A music generation model that listens

Overview

StemGen is an end-to-end music generation model, trained to listen to musical context and respond appropriately. It’s built on a non-autoregressive language-model type architecture similar to SoundStorm and VampNet. More details are available in the paper.

This page presents a number of example outputs from models of this architecture.

Models / datasets

We present examples from three different models here:

Name	Dataset	Conditioning	Tokenizer	Params
`slakh`	Slakh2100	Target instrument category	32kHz Encodec	~250M
`internal`	internal dataset of 500 hours of human-played music available in individual instrument stems	Target instrument category	32kHz Encodec	~250M
`mingus` *	Pretrained on 500 hours of synthetic data from an internal symbolic music generation model. Fine tuned on 2hrs of high quality human composed and produced music.	Genre category, target stem category	stereo 48kHz Encodec	~250M

* not presented in paper

Test set examples

These examples are produced by constructing context audio from the test dataset partition, and using a model to generate a single stem (conditioning chosen at random) in response. They therefore closely reflect the task presented to the model at training time.

Model	`slakh`	`slakh`
Target category	Guitar	Guitar
Context
Generated stem
Mixed

Model	`slakh`	`slakh`
Target category	Synth	Drums
Context
Generated stem
Mixed

Model	`slakh`	`slakh`
Target category	Drums	Drums
Context
Generated stem
Mixed

Model	`internal`	`internal`
Target category	Guitar	Drums
Context
Generated stem
Mixed

Iterative generation examples

These examples are produced by providing the models with an arbitrary piece of context audio as a starting point. A new stem is generated from that context, and mixed with the existing audio. This new mixed audio is used as the context to generate another stem, and the process repeats. These examples therefore represent a much more challenging situation for the models, as they need to listen to and interpret both out-of-distribution audio and their own output.

These examples reflect a more typical use-case of StemGen, with a user constructing music iteratively in a chat-like environment. In order to preserve how this interactive process proceeded, we also show situations where the user generated multiple variations of a particular stem. These are denoted as variations of the iteration. When one is chosen to proceed for further iteration, it is marked in italics

Starting from drums

In this example, we use a short drum loop composed by the authors in Ableton Live as a starting point.

This example is chosen to demonstrate how the models follow rhythm, and also how they can generate coherant harmonic and melodic elements even with no harmonic or melodic information provided at the start of the process.

`slakh`

Iteration	1	2 (var. 1)	2 (var. 2)	2 (var. 3)
Conditioning	Bass	Piano	Piano	Woodwind
Context
Generated stem
Mixed

`internal`

Italic denotes which variation was used to continue generation

Iteration	1 (var. 1)	1 (var. 2)	2	3
Conditioning	Guitar	Guitar	Percussion	Bass
Context
Generated stem
Mixed

`mingus`

Iteration	1	2	3	4
Conditioning	Electronic, Melodic	Electronic, Harmonic	Electronic, Harmonic	Electronic, Percussive
Context
Generated stem
Mixed

Starting from chords

In this example, we use a short synth chord sequence composed by the authors in Ableton Live as a starting point.

This example is targeted to show how the models can respond to harmony in a musically plausibly way.

`slakh`

Iteration	1	2
Conditioning	Piano	Drums
Context
Generated stem
Mixed

`internal`

Iteration	1	2	3
Conditioning	Guitar	Guitar	Bass
Context
Generated stem
Mixed

Deep iterative layering

In this example we start generation from silence, and repeatedly ask the slakh model to generate a piano stem, which is layered on top of the existing context (somewhat inspired by Alvin Lucier). We go through 9 iterations, after which we ask the model to generate a string stem and a bass stem. This example demonstrates the ability of the model to senstively add musical content even when presented with a musically dense input, whilst maintaining rhythmic and harmonic coherance.

In this audio example the iterations are presented sequentially over time. The raw iterations and stems are available here.

Live interactive music generation demos

In this example we built a prototype of a real-time musical performance device, based on the mingus StemGen model. The application allows looping of 4 channels of audio, with the ability to apply reverb, delay and a DJ-style lowpass/highpass filter to each channel. Each channel has a ‘generate’ button denoted by the robot icon, which provides the StemGen model with the current mixed loop as context and returns a stem of the desired genre and type.

Using this device, a user can build up a musical composition interactively in real-time by requesting new stems, blending them with existing content, and manipulating them.

The below videos give two examples of improvisations with the device.

This real-time device was built with the additional collaboration of David Trevelyan, Gleb Mineev and Peter Glushkov.

This site is open source. Improve this page.

stemgen

StemGen: A music generation model that listens

Contents

Overview

Models / datasets

Test set examples

Iterative generation examples

Starting from drums

slakh

internal

mingus

Starting from chords

slakh

internal

Deep iterative layering

Live interactive music generation demos

`slakh`

`internal`

`mingus`

`slakh`

`internal`