Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Abstract

In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models.Our codes and examples is availble at: https://github.com/WangTooNaive/MozartsTouch.

Check out our paper on arxiv for more information: Here .

Case Study

Original description:

a painting of a man with white hair and a black jacket, a portrait inspired, baroque, classic portrait, smug smirk, 18th century art

(Generated music: dull and tedious piano melody) ✗

Converted description:

A classical chamber piece with intricate melodies, rich harmonies, and elegant phrasing, embodying the sophistication of an 18th-century portrait.

(Generated music: classical and elegant piano melody) ✓

Original description:

a girl in a white dress standing in the water, an anime drawing, serial art, anime girl walking on water, in style of Japanese Anime, blue sea.

(Generated music: Japanese traditional melody) ✗

Converted description:

A serene piano melody accentuated with ethereal strings and delicate vocals, evoking a sense of beauty and surrealism, akin to Japanese emotional anime soundtracks.

(Generated music: peaceful and soft anime-like melody) ✓

Samples: comparison to prior work

We compare Mozart's Touch with two prior work detailed in the paper: M2UGen, CoDi.
Examples are sampled from the same dataset MUImage.

Picture

Ground Truth

Mozart's Touch

CoDi

M2UGen