World-VLA-Loop : Closed-Loop Learning of Video World Model and VLA Policy

Show Lab
National University of Singapore

*Indicates Equal Contribution
World-VLA-Loop Teaser

(a) Paradigms for world-model-based VLA reinforcement learning. Existing methodologies typically rely on reconstructing the environment within 3D world or training video world models that simulate the environment. To address the imprecise action-following inherent in existing video-based simulators, we propose World-VLA-Loop, a closed-loop paradigm that jointly optimizes the world model and the VLA policy to iteratively enhance the performance and grounding of both. (b) We show that the real-world policy success rate is improved by 36.7% after two iterations of joint optimization with VLA model and world model.

Abstract

Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the Sans dataset, which incorporates near-Success Trajectory to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: Failure Trajectory rollouts generated by the VLA policy are iteratively fed back to refine the world model’s precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics.

Our Method

World-VLA-Loop framework consists of four phases.

  1. Curate success and near-success dataset (SANS) mainly via manual teleoperation. Few demonstrations are needed.
  2. Fine-tune the action-conditioned world model on SANS dataset with joint reward and video supervision.
  3. Execute VLA policy rollouts within the world model and perform RL (GRPO) optimization.
  4. Deploy the refined policy in real-world. And in real-world deployment, new rollouts could be used to collect new Failure Trajectory and success data for further SANS dataset augmentation, which can be used to iteratively improve the world model and policy.
This cycle enables the joint optimization of the world model and the VLA policy, iteratively enhancing both performance.

A table comparing probing to previous approaches

Full pipeline of our proposed framework.

Results

Success rates are computed across 500 rollouts for the LIBERO suites and 30 physical rollouts for our real-world experiments.

A table comparing probing to previous approaches

World Model Generation Results

Through iterative updating, our world model can generate videos that are highly aligned with conditioning action and adhere to physical dynamics. Note: there is a minor temporal mismatch between the ground-truth and generated videos. This discrepancy arises because we truncate the ground-truth action trajectories to the nearest multiple of our fixed action chunk size rather than using their full original length. To ensure consistent visual comparison, we synchronize the frame rates (FPS) of both videos to achieve an identical total duration.

World Model Generation in OOD Cases

Our world model still faithfully follow the action condition inputs even with random action sequence unseen during training.

RL Fine-tuned VLA Performance

After RL fine-tuning within our world model, VLA actions produce more accurate grasping.

Robustness to OOD Cases

Our model demonstrates robustness to out-of-distribution cases, indicating that the model retains original robustness after our RL fine-tuning.

Unseen Lighting Condition

Unseen Distractor

Unseen Distractors

Unseen Distractors with Disturbance

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}