X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Pei Yang*, Hai Ci*, Yiren Song, and Mike Zheng Shou

Show Lab, National University of Singapore

X-Humanoid transforms human videos into humanoid videos, creating data for training humanoid robot policies and world models.

At the core of X-Humanoid is a Diffusion Transformer (DiT) for video-to-video editing. It is adapted from the powerful Wan 2.2 video generation model into a video-in video-out structure.

Paper teaser image

The model is finetuned on synthetic video pairs rendered in Unreal Engine. Each video pair contains a human and a humanoid performing synchronized motions in the same scene.

These paired videos are created using abundant community assets from the Fab marketplace. We align these assets to make them compatible so that the same animations can be played on different characters.

Methodology figure

To adapt Wan 2.2 for video-to-video generation, we encode the input video into non-denoising condition tokens, concatenated with the denoising generation tokens. Only the generation tokens are denoised and then decoded to produce the edited video.

Methodology figure

X-Humanoid generates videos with the correct humanoid embodiment, with motions synchronized to the original video. Most users prefer X-Humanoid over the baselines.

69%
Best Motion Consistency
75%
Best Background Consistency
62%
Best Embodiment Consistency
62%
Best Video Quality

X-Humanoid can also robotize various Internet videos, enabling creative and diverse application scenarios.

Acknowledgement

X-Humanoid was created with the valuable discussions with Cheng Liu and Rui Zhao.

@article{xhumanoid, title={X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale}, author={Pei Yang and Hai Ci and Yiren Song and Mike Zheng Shou}, year={2025}, archivePrefix={arXiv}, primaryClass={cs.AI}, }