Mitty: Diffusion-based Human-to-Robot Video Generation

Yiren Song, Cheng Liu, Weijia Mao, Mike Zheng Shou^†

Show Lab, National University of Singapore

At the core of Mitty is a Wan 2.2–based Diffusion Transformer that uses video in-context learning: human demonstration videos are encoded as clean condition tokens, guiding noisy robot tokens via bidirectional attention to generate temporally aligned robot-arm executions.

Mitty builds on a Diffusion Transformer–based video generator and adopts a video in-context learning paradigm. Human demonstration tokens are concatenated with noisy robot latent tokens, with noise injected only into the robot branch; bidirectional attention lets information flow between them so the model learns to generate robot manipulation videos directly from human demonstrations.

We detect hands with Detectron2 and segment them using Segment Anything, followed by hand keypoint detection and inpainting to recover clean backgrounds. Using inverse kinematics, we map hand keypoints to robot arm poses and render them. Finally, a human-in-the-loop filtering process curates over 6,000 high-quality human-robot paired videos for training Mitty.

Masquerade's multi-stage pipeline is prone to compounded errors (e.g., joint detection, inpainting, and rendering failures).

Compared to state-of-the-art video editing models, baselines struggle to maintain appearance and structural consistency of the robot arm throughout the sequence, even with a reference image and a human demonstration video as input.

Acknowledgement

We thank our collaborators and supporters for their valuable help. Special thanks to Danze Chen for his contributions to the Masquerade synthetic dataset.

                @article{mitty2025,
    title  = {Mitty: Diffusion-based Human-to-Robot Video Generation},
    author = {Yiren Song and Cheng Liu and Weijia Mao and Mike Zheng Shou},
    year   = {2025},
}