Olaf-World: Orienting Latent Actions for Video World Modeling

Yuxin Jiang1,2, Yuchao Gu1, Ivor W. Tsang2, Mike Zheng Shou1

1Show Lab, National University of Singapore     2CFAR & IHPC, Agency for Science, Technology and Research (A*STAR), Singapore

arxiv 2026

Abstract

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

Method Overview

Why latent actions fail to transfer?

Transition-based latent action models (LAMs) often achieve low reconstruction error within each clip, yet the latent directions are not comparable across contexts. Two common failure modes are: (1) shortcut learning: latents entangle context cues rather than action effects and (2) cross-context non-identifiability: each context induces its own latent coordinate system. As a result, the same semantic action (e.g., “Forward”) may correspond to different latent directions in different environments, leading to poor action transfer.

SeqΔ-REPA aligns latent actions using effect directions

Our insight: align actions by observable effects

Although actions are unobserved, action-induced changes are visible in video. We compute an effect direction in a frozen self-supervised video representation and use it as a cross-context reference. SeqΔ-REPA, a sequence-level control-effect alignment objective, aligns latent-action trajectories to these effect directions, encouraging a shared latent coordinate system and more consistent action semantics across environments.

SeqΔ-REPA aligns latent actions using effect directions

Results (Latent Space Diagnostics)

We validate transferability with two simple diagnostics that probe whether the learned latent-action space forms a shared coordinate system across contexts.

(1) Cross-context linear probing

We train a linear probe on one context (1st-P) and evaluate on another (3rd-P), where viewpoint and appearance differ. SeqΔ-REPA improves in-context decodability (solid lines) and, more importantly, cross-context transfer (dashed lines), indicating more context-invariant latent actions. In line with the REPA line of work, we also observe a clear early-stage performance boost, aka learning latent action is easier than you think.

Cross-context linear probing results

(2) Cross-context action similarity (cosine)

We compute per-action prototypes within each context and visualize cosine similarity between 1st-P (rows) and 3rd-P (columns also highlighted in gray) prototypes. A well-aligned latent space is diagonal-dominant, meaning each action matches its cross-context counterpart more than other actions. SeqΔ-REPA produces a sharper diagonal, while baselines show broadly high similarity, suggesting entangled or poorly anchored action directions.

Cosine similarity / alignment results

1. Zero-Shot Action Transfer

1.1 Qualitative Comparison

Comparison of action transfer quality between AdaWorld baseline and our method. Each row shows the action reference video, AdaWorld's result, and our method's result.

Action Reference
AdaWorld
Olaf-World
Action Reference
AdaWorld
Olaf-World
Action Reference
AdaWorld
Olaf-World
Action Reference
AdaWorld
Olaf-World
Action Reference
AdaWorld
Olaf-World
Action Reference
AdaWorld
Olaf-World

1.2 Transfer to Different Contexts

Our method can transfer the same source action to multiple different contexts.

Action Reference
Context A
Context B
Action Reference
Context A
Context B
Action Reference
Context A
Context B
Action Reference
Context A
Context B

1.3 Transfer Various Actions into One Context

Demonstration of transferring multiple different actions from various source contexts into a single target context. Each pair shows the source action reference and the transferred result in the target context.

Source (a)
Transfer (a)
Source (b)
Transfer (b)
Source (c)
Transfer (c)
Source (d)
Transfer (d)
Source (e)
Transfer (e)
Source (f)
Transfer (f)

1.4 Interesting Cases

As an add-on, we showcases a few interesting and diverse action-sequence transfer examples from Olaf-World, beyond movement and camera control, e.g., flying, shooting, combat (melee attacks + special/ultimate moves), and object despawn. Each pair shows the source reference video and its corresponding transferred result side by side.

Source (a)
Transfer (a)
Source (b)
Transfer (b)
Source (c)
Transfer (c)
Source (d)
Transfer (d)
Source (e)
Transfer (e)
Source (f)
Transfer (f)

2. World Model Adaptation

2.1 Efficient Adaptation Comparison

We compare DirectAct, AdaWorld, and Olaf-World under different adaptation data budgets: #videos = 0, 1, 50 (0-shot, ~1 min, ~2 hours of action-labeled video). All methods use the same adaptation setup with LoRA rank = 16. Under the ~1 min budget, DirectAct tends to overfit and AdaWorld follows actions less reliably, while Olaf-World remains consistently controllable.

2.1.1 First-Person (1ST-P)

DirectAct
AdaWorld
Olaf-World
0
1
50

2.1.2 Third-Person (3RD-P)

DirectAct
AdaWorld
Olaf-World
0
1
50

2.2 More Adaptation Results

Additional results from Olaf-World for the 50-video adaptation setting, evaluating both in-domain performance and generalization to unseen contexts.

2.2.1 First-Person (1ST-P)

2.2.2 Third-Person (3RD-P)

2.2.3 Generalization to Novel Scenes

BibTeX

If you find our work useful, please consider citing:

@misc{jiang2026olafworldorientinglatentactions,
      title={Olaf-World: Orienting Latent Actions for Video World Modeling}, 
      author={Yuxin Jiang and Yuchao Gu and Ivor W. Tsang and Mike Zheng Shou},
      year={2026},
      eprint={2602.10104},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.10104}, 
}