H2R
H2R-Grounder

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Hai Ci Xiaokang Liu Pei Yang Yiren Song Mike Zheng Shou*
Show Lab, National University of Singapore
{cihai03,mike.zheng.shou}@gmail.com
* Corresponding author
Abstract
We present H2R-Grounder, a paired-data-free paradigm that translates third-person human-object interaction videos into frame-aligned robot manipulation videos. H2R-Grounder decomposes interaction videos into a shared representation H2Rep of (i) an abstract pose sequence capturing hand/gripper motion and (ii) a background video encoding scene and object dynamics. Trained solely on unpaired robot videos, our in-context video generation model produces physically grounded robot arms that follow human intent while preserving background consistency and realistic contact dynamics. Extensive experiments on DexYCB and diverse Internet HOI videos demonstrate state-of-the-art motion fidelity and cross-domain generalization without requiring any human-robot paired supervision.
Method Overview
H2R-Grounder method overview figure
H2R-Grounder bridges human and robot videos through a shared abstract representation:
  • H2Rep extraction: From HOI videos, we estimate hand pose and remove the human; from robot videos, we estimate gripper pose and inpaint the robot arm. Both are converted into a unified abstract pose + background form.
  • Paired-data-free training: An in-context video generation model is fine-tuned only on unpaired robot videos to map H2Rep → robot manipulation video.
  • Physically grounded synthesis: During inference, the model follows pose dynamics, keeps background/object motion consistent, and inserts a geometrically plausible robot arm.
See paper for details on pose indicators, masking, and training objectives.
Qualitative Video Gallery
Use the buttons below to filter the videos by setting.