1Show Lab, National University of Singapore 2CFAR & IHPC, Agency for Science, Technology and Research (A*STAR), Singapore
arxiv 2026
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
Transition-based latent action models (LAMs) often achieve low reconstruction error within each clip, yet the latent directions are not comparable across contexts. Two common failure modes are: (1) shortcut learning: latents entangle context cues rather than action effects and (2) cross-context non-identifiability: each context induces its own latent coordinate system. As a result, the same semantic action (e.g., “Forward”) may correspond to different latent directions in different environments, leading to poor action transfer.
Although actions are unobserved, action-induced changes are visible in video. We compute an effect direction in a frozen self-supervised video representation and use it as a cross-context reference. SeqΔ-REPA, a sequence-level control-effect alignment objective, aligns latent-action trajectories to these effect directions, encouraging a shared latent coordinate system and more consistent action semantics across environments.
We validate transferability with two simple diagnostics that probe whether the learned latent-action space forms a shared coordinate system across contexts.
We train a linear probe on one context (1st-P) and evaluate on another (3rd-P), where viewpoint and appearance differ. SeqΔ-REPA improves in-context decodability (solid lines) and, more importantly, cross-context transfer (dashed lines), indicating more context-invariant latent actions. In line with the REPA line of work, we also observe a clear early-stage performance boost, aka learning latent action is easier than you think.
We compute per-action prototypes within each context and visualize cosine similarity between 1st-P (rows) and 3rd-P (columns also highlighted in gray) prototypes. A well-aligned latent space is diagonal-dominant, meaning each action matches its cross-context counterpart more than other actions. SeqΔ-REPA produces a sharper diagonal, while baselines show broadly high similarity, suggesting entangled or poorly anchored action directions.
Comparison of action transfer quality between AdaWorld baseline and our method. Each row shows the action reference video, AdaWorld's result, and our method's result.
Our method can transfer the same source action to multiple different contexts.
Demonstration of transferring multiple different actions from various source contexts into a single target context. Each pair shows the source action reference and the transferred result in the target context.
As an add-on, we showcases a few interesting and diverse action-sequence transfer examples from Olaf-World, beyond movement and camera control, e.g., flying, shooting, combat (melee attacks + special/ultimate moves), and object despawn. Each pair shows the source reference video and its corresponding transferred result side by side.
We compare DirectAct, AdaWorld, and Olaf-World under different adaptation data budgets: #videos = 0, 1, 50 (0-shot, ~1 min, ~2 hours of action-labeled video). All methods use the same adaptation setup with LoRA rank = 16. Under the ~1 min budget, DirectAct tends to overfit and AdaWorld follows actions less reliably, while Olaf-World remains consistently controllable.
Additional results from Olaf-World for the 50-video adaptation setting, evaluating both in-domain performance and generalization to unseen contexts.
If you find our work useful, please consider citing:
@misc{jiang2026olafworldorientinglatentactions,
title={Olaf-World: Orienting Latent Actions for Video World Modeling},
author={Yuxin Jiang and Yuchao Gu and Ivor W. Tsang and Mike Zheng Shou},
year={2026},
eprint={2602.10104},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.10104},
}