MoonShot: Towards Controllable Video Generation and Editing with Multimodal Conditions

David Junhao Zhang Dongxu Li Hung LeMike Zheng ShouCaiming XiongDoyen Sahoo

Salesforce Research    Show Lab, NUS

[arXiv]      [Code(coming soon)]


Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents MoonShot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, \name~demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation.

Zero-Shot Subject Customized Video Generation


Text to Video Generation

An astronaut is walking on the moon
A panda standing on a surfboard in the ocean in sunset
Robotic eagle, 8k unreal engine render, wires and gears
Leica portrait of a gremlin skateboarding
A space probe zooms by, carrying scientific instruments, exploring uncharted interstellar regions
A disoriented astronaut, lost in a galaxy of swirling colors, floating in zero gravity
A painting of a french bulldog dog portrait in the style of vincent van gogh the starry night, thick paint, big brushstrokes
A cute and tiny frog commander inside the Space Shuttle's control cockpit

Directly Using Image ControlNet

Image Animation Comparisons

Video Editing Comparisons

Ablation for Mutimodal Cross-Attn for Video Generation

Ablation for Mutimodal Cross-Attn for Image Animation