ReelWave: Multi-Agentic Movie Sound Generation through Multimodal LLM Conversation


Abstract

Current audio generation conditioned by text or video focuses on aligning audio with text/video modalities. Despite excellent alignment results, these multimodal frameworks still cannot be directly applied to compelling movie storytelling involving multiple scenes, where "on-screen" sounds require temporally-aligned audio generation, while "off-screen" sounds contribute to appropriate environment sounds accompanied by background music when applicable. Inspired by professional movie production, this paper proposes a multi-agentic framework for audio generation supervised by an autonomous Sound Director agent, engaging multi-turn conversations with other agents for on-screen and off-screen sound generation through multimodal LLM. To address on-screen sound generation, after detecting any talking humans in videos, we capture semantically and temporally synchronized sound by training a prediction model that forecasts interpretable, time-varying audio control signals: loudness, pitch, and timbre, which are used by a Foley Artist agent to condition a cross-attention module in the sound generation. The Foley Artist works cooperatively with the Composer and Voice Actor agents, and together they autonomously generate off-screen sound to complement the overall production. Each agent takes on specific roles similar to those of a movie production team. To temporally ground audio language models, in ReelWave, text/video conditions are decomposed into atomic, specific sound generation instructions synchronized with visuals when applicable. Consequently, our framework can generate rich and relevant audio content conditioned on video clips extracted from movies.



ReelWave is applied to generate sound for different scenarios. Our multi-agent framework takes a short plot summary description and generates sound effects, audible dialogues, and music that are suitable for the plot.






Silent movies dubbing

Our method can be used to generate audio for silent movies to bring them back to live.

Example 1

WAV Files

Scene 1

    Foley File
  • Music File
  • On-screen Sound File

Scene 2

  • Foley File
  • Music File
  • Voice File
  • On-screen Sound File

    Example 2

    WAV Files

    Scene 1

    • Foley File
    • On-screen Sound File

    Scene 2

    • Foley File
    • On-screen Sound File


      Our method can be used to generate audio for anime.

      Example 1

      WAV Files

      Scene 1

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 2

      • Foley File
      • On-screen Sound File

      Scene 3

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 4

      • Foley File
      • On-screen Sound File

      Example 2

      WAV Files

      Scene 1

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 2

      • Foley File
      • Music File
      • Voice File
      • On-screen Sound File


      Sound Effect for Action Movie

      Our method can be used to generate audio for live-action movie clips.

      Example 1

      WAV Files

      Scene 1

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 2

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 3

      • Foley File
      • Music File
      • On-screen Sound File

      Example 2

      WAV Files

      Scene 1

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 2

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 3

      • Foley File
      • On-screen Sound File

      Example 3

      WAV Files

      Scene 1

      • Foley File
      • Music File
      • On-screen Sound File

      Scene 2

      • Foley File
      • On-screen Sound File


      Movie Sound Generation from Plot Summary

      Our method can be used to generate audio given movie plot description.

      Example 1

      WAV Files

      • Foley File
      • Music File
      • Voice File

      Example 2

      WAV Files

      • Foley File
      • Music File
      • Voice File

      Example 3

      WAV Files

      • Foley File
      • Music File
      • Voice File


      On-screen audio generation from AVSync dataset

      diff_foley Im2Wav Seeing_and_hearing ReelWave