MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Rui Zhao1 Yuchao Gu1 Jay Zhangjie Wu1 David Junhao Zhang1 Jia-Wei Liu1
Weijia Wu3 Jussi Keppo2 Mike Zheng Shou1

1Show Lab, 2National University of Singapore 3Zhejiang University

 

📖TL;DR: MotionDirector can customize text-to-video diffusion models to generate videos with desired motions.

 




Abstract

Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions.

 


Method

MotionDirector with a dual-path architecture learns the appearances and motions of reference videos in a decoupled way. All pre-trained weights of the foundational text-to-video diffusion model remain fixed. In the training stage, the spatial LoRAs learn to fit the appearances of the reference videos, while the temporal LoRAs learn their motion dynamics. During inference, injecting the trained temporal LoRAs into foundation models enables them to generalize the learned motions to diverse appearances.


Results


Decouple the appearances and motions!

(Row 1) Take two videos to train the proposed MotionDirector, respectively. (Row 2) MotionDirector can generalize the learned motions to diverse appearances. (Row 3) MotionDirector can mix the learned motion and appearance from different videos to generate new videos. (Row 4) MotionDirector can animate a single image with learned motions.



Motion customization on multiple videos



Motion customization on a single video



More results




Bibtex


        @article{zhao2023motiondirector,
          title={MotionDirector: Motion Customization of Text-to-Video Diffusion Models},
          author={Zhao, Rui and Gu, Yuchao and Wu, Jay Zhangjie and Zhang, David Junhao and Liu, Jiawei and Wu, Weijia and Keppo, Jussi and Shou, Mike Zheng},
          journal={arXiv preprint arXiv:2310.08465},
          year={2023}
        }