ViTex: Visual Texture Control for Multi-track Symbolic Music Generation via Discrete Diffusion Models

Code Repo

Xiaoyu Yi, Qi He, Gus Xia, Ziyu Wang

We propose ViTex: a representation that can (1) intuitively visualize track-wise, texture-level instrumentation ideas, and (2) serve as a conditioning signal to guide music generation. We further train a discrete diffusion model that takes ViTex and chord progression as inputs to generate multi-track symbolic music. Our model focuses on 4/4 time signature and 8-bar pieces. This demo page showcases our model's outputs under different generation settings, organized as follows:

  1. Conditional Generation Given ViTex and Chord Progression
    Given different ViTex and chord progressions as inputs, our model produces diverse generations.
  2. Prompt Continuation
    Based on the first setting, we additionally provide the first two bars as context. Using diffusion-based inpainting, the model continues the remaining six bars.
  3. Effect of Different Control Scales
    Since both ViTex and chord progression are trained with classifier-free guidance, we can independently adjust the control strengths $\lambda_{\text{ins}}$ (instrumentation) and $\lambda_{\text{chd}}$ (chord). This section demonstrates how varying control scales influence generation.
  4. Unconditional Generation
    We also show samples generated without any conditioning, demonstrating the model's ability to produce musically coherent pieces on its own.
Regarding drums, our model supports drum generation. In this demo page, the drums are generated using a fixed drum ViTex.

Conditional Generation Given ViTex and Chord Progression

Below is a demo — click the left and right arrows to browse different ViTex and chord progressions. The displayed ViTex and chord progression are used as the model's inputs, with both control scales fixed at $\lambda_{\text{ins}} = \lambda_{\text{chd}} = 1.0$. The corresponding generated result is shown on the right.

ViTex (Instrumentation)

carousel
1 / 3

+

Chord Progression

1 / 3

=

Multi-track Music

Audio pianoroll

Prompt Continuation

By leveraging diffusion-based inpainting, our model supports music continuation given prompts. On the right is an example of prompt continuation: we feed the first two bars of the ground-truth piece, along with their corresponding ViTex and chord progression with control scale $\lambda_{\text{ins}} = \lambda_{\text{chd}} = 1.0$, into the model to generate the following six bars. We also compare our results with those produced by the Anticipatory Music Transformer (AMT) and Multitrack Music Transformer (MMT). For AMT, we used its medium-sized checkpoint, and for MMT, we used the first checkpoint made publicly available among its many public checkpoints.

Given Prompt

Given Prompt

Ground Truth

Ours

AMT

MMT

Comparison Pianoroll

Given Prompt

Given Prompt

Ground Truth

Ours

AMT

MMT

Comparison Pianoroll

Given Prompt

Given Prompt

Ground Truth

Ours

AMT

MMT

Comparison Pianoroll

Effect of Different Control Scales

Both the ViTex and chord progression conditions are trained using classifier-free guidance. We can independently adjust their control strengths, denoted as $\lambda_{\text{ins}}$ and $\lambda_{\text{chd}}$. For simplicity, we fix the ViTex and chord progression as follows:

ViTex (Instrumentation)

carousel

Chord Progression

We then vary the values of $\lambda_{\text{ins}}$ and $\lambda_{\text{chd}}$. The table below shows the model’s generated outputs under different control strengths. From left to right, as the chord control increases, the generated music increasingly aligns with the given chord progression (F, G, Em, Am, F, G, Em, Am). From top to bottom, the model’s instrumentation gradually shifts from random to conforming to the specified ViTex control.

$\lambda_{\text{chd}} = 0.0$ $\lambda_{\text{chd}} = 0.2$ $\lambda_{\text{chd}} = 0.5$ $\lambda_{\text{chd}} = 0.7$
$\lambda_{\text{ins}} = 0.0$ Vi1-Ch1 Vi1-Ch2 Vi1-Ch3 Vi1-Ch4
$\lambda_{\text{ins}} = 0.3$ Vi2-Ch1 Vi2-Ch2 Vi2-Ch3 Vi2-Ch4
$\lambda_{\text{ins}} = 1.2$ Vi3-Ch1 Vi3-Ch2 Vi3-Ch3 Vi3-Ch4

Unconditional Generation

Below, we present additional samples of unconditional generation results, where both control strengths are set to zero (i.e., $\lambda_{\text{ins}} = \lambda_{\text{chd}} = 0.0$).

Audio pianoroll
Audio pianoroll
Audio pianoroll