ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation
Abstract
Current audio generation conditioned by text or video focuses on aligning audio with text/video modalities. Despite excellent alignment results, these multimodal frameworks still cannot be directly applied to compelling movie storytelling involving multiple scenes, where "on-screen" sounds require temporally-aligned audio generation, while "off-screen" sounds contribute to appropriate environment sounds accompanied by background music when applicable. Inspired by professional movie production, this paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent, engaging multi-turn conversations with other agents for on-screen and off-screen sound generation through multimodal LLM. To address on-screen sound generation, after detecting any talking humans in videos, we capture semantically and temporally synchronized sound by training a prediction model that forecasts interpretable, time-varying audio control signals: loudness, pitch, and timbre, which are used by a Foley Artist agent to condition a cross-attention module in the sound generation. The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production. Each agent takes on specific roles similar to those of a movie production team. To temporally ground audio language models, in ReelWave, text/video conditions are decomposed into atomic, specific sound generation instructions synchronized with visuals when applicable. Consequently, our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.
ReelWave is applied to generate sound for different scenarios. Our multi-agent framework takes a short plot summary description and generates sound effects, audible dialogues, and music that are suitable for the plot.
Table of Contents
Silent movies dubbing
Our method can be used to generate audio for silent movies to bring them back to live.
Example 1
|
WAV Files
Scene 1
-
Foley File
- Music File
- On-screen Sound File
Scene 2
- Foley File
- Music File
- Voice File
- On-screen Sound File
- Foley File
- On-screen Sound File
- Foley File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- Music File
- Voice File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- On-screen Sound File
- Foley File
- Music File
- On-screen Sound File
- Foley File
- On-screen Sound File
- Foley File
- Music File
- Voice File
- Foley File
- Music File
- Voice File
- Foley File
- Music File
- Voice File
Example 2
|
WAV Files
Scene 1
Scene 2
Anime dubbing
Our method can be used to generate audio for anime.
Example 1
|
WAV Files
Scene 1
Scene 2
Scene 3
Scene 4
Example 2
|
WAV Files
Scene 1
Scene 2
Sound Effect for Action Movie
Our method can be used to generate audio for live-action movie clips.
Example 1
|
WAV Files
Scene 1
Scene 2
Scene 3
Example 2
|
WAV Files
Scene 1
Scene 2
Scene 3
Example 3
|
WAV Files
Scene 1
Scene 2
Movie Sound Generation from Plot Summary
Our method can be used to generate audio given movie plot description.
Example 1
|
WAV Files
Example 2
|
WAV Files
Example 3
|
WAV Files
On-screen audio generation from AVSync dataset
diff_foley | Im2Wav | Seeing_and_hearing | ReelWave |
---|