We present H2R-Grounder, a paired-data-free paradigm that translates third-person human-object interaction videos into
frame-aligned robot manipulation videos. H2R-Grounder decomposes interaction videos into a shared representation H2Rep of
(i) an abstract pose sequence capturing hand/gripper motion and (ii) a background video encoding scene and object dynamics.
Trained solely on unpaired robot videos, our in-context video generation model produces physically grounded robot arms
that follow human intent while preserving background consistency and realistic contact dynamics.
Extensive experiments on DexYCB and diverse Internet HOI videos demonstrate state-of-the-art motion fidelity and
cross-domain generalization without requiring any human-robot paired supervision.
Method Overview
H2R-Grounder bridges human and robot videos through a shared abstract representation:
H2Rep extraction: From HOI videos, we estimate hand pose and remove the human; from robot videos,
we estimate gripper pose and inpaint the robot arm. Both are converted into a unified abstract pose + background form.
Paired-data-free training: An in-context video generation model is fine-tuned only on unpaired robot videos to
map H2Rep → robot manipulation video.
Physically grounded synthesis: During inference, the model follows pose dynamics, keeps background/object motion
consistent, and inserts a geometrically plausible robot arm.
See paper for details on pose indicators, masking, and training objectives.
Qualitative Video Gallery
Use the buttons below to filter the videos by setting.
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)
Ours · DexYCB
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)-Incorrect Pose
Ours · DexYCB
The model has certain level of tolerance to the incorrect pose sequence. They can generate appropriate arm motion according to the object motion in H2Rep.
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (DexYCB)-Incorrect Pose
Ours · DexYCB
The model has certain level of tolerance to the incorrect pose sequence. They can generate appropriate arm motion according to the object motion in H2Rep.
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Internet)
Ours · Internet
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Badcase)
Ours · Badcase
Artifacts.
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Badcase)
Ours · Badcase
Scale issues in open wide scenes.
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Badcase)
Ours · Badcase
The issue of precise grasping of thin objects.
Human Video (input)
H2Rep (ours pose)
Robot Video (ours)
Ours (Badcase)
Ours · Badcase
Generation instability in first-person scenarios. This could be addressed by fine-tuning on a first-person dataset. Currently, this paper uses third-person dataset (Droid) as the training set.