Efficient Fine-Grained Guidance for Diffusion Model Based Symbolic Music Generation

Tingyu Zhu^*, Haoyu Liu^*, Ziyu Wang, Zhimin Jiang, Zeyu Zheng

We introduce Fine-Grained Guidance (FGG), an efficient approach for symbolic music generation using diffusion models. Our method enhances guidance through:
(1) Fine-grained conditioning during training,
(2) Fine-grained control during the diffusion sampling process.
In particular, sampling control ensures tonal accuracy in every generated sample, allowing our model to produce music with high precision, consistent rhythmic patterns, and even stylistic variations that align with user intent.

1. Accompaniment Generation given Melody and Chord

We provide the model with the melody and chord as inputs, and the model will generate the accompaniment accordingly.

In each example, the left column displays the melody provided as inputs to the model. The right column showcases music samples generated by the model. The score for the melody and accompaniment are provided in section 4.

Example 1

With the following melody as condition

Generated Accompaniments

Example 2

With the following melody as condition

Generated Accompaniments

Example 3

With the following melody as condition

Generated Accompaniments

Example 4

With the following melody as condition

Generated Accompaniments

2. Style-Controlled Music Generation

Our approach enables controllable stylization in music generation. The sampling control is able to ensure that all generated notes strictly adhere to the target musical style's scale. This allows the model to generate music in specific styles — even those that were not present in the training data.

Below, we demonstrate several examples of style-controlled music generation for:

Dorian Mode: (with scale being A-B-C-D-E-F#-G)
Chinese Style: (with scale being C-D-E-G-A)

Dorian Mode

The following are two examples generated by our method

Example 1

Example 2

Chinese Style

The following are two examples generated by our method

Example 1

Example 2

3. Demonstrating the Effectiveness of Our Proposed Method by Comparison

We demonstrate the impact of sampling control in an accompaniment generation task, given a melody and chord progression. Each example generates accompaniments using the same random seed (but different ablative conditions), ensuring that the results are comparable.

The ablative conditions are as follows:

Only Training Control
Training Control + Remove Out-of-Key Notes at the last step
Training Control + Round Out-of-Key Notes to the Nearest allowable note at the last step
Inpainting Method

Comparison of the results indicates that sampling control not only eliminates out-of-key notes but also enhances the overall coherence and harmonic consistency of the accompaniments. This highlights the effectiveness of our approach in maintaining musical coherence.

Example 1

With pre-defined melody and chord as follows

Generated Accompaniments from Different Ablative Conditions

Training Control + Sampling Control (our proposed method)

Only Training Control

Training Control + Remove Out-of-Key Notes

Training Control + Round Out-of-Key Notes to Nearest

Inpainting Method

Example 2

With pre-defined melody and chord as follows

Generated Accompaniments from Different Ablative Conditions

Training Control + Sampling Control (our proposed method)

Only Training Control

Training Control + Remove Out-of-Key Notes

Training Control + Round Out-of-Key Notes to Nearest

Inpainting Method

4. DIY in Real Time!

Try our interactive music generation tool where you can generate new accompaniments for given melody and chord conditions! Visit our Hugging Face demo to experiment with the model.

Try the Interactive Demo →