Abstract
Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the Sans dataset, which incorporates near-Success Trajectory to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: Failure Trajectory rollouts generated by the VLA policy are iteratively fed back to refine the world model’s precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics.
Our Method
World-VLA-Loop framework consists of four phases.
- Curate success and near-success dataset (SANS) mainly via manual teleoperation. Few demonstrations are needed.
- Fine-tune the action-conditioned world model on SANS dataset with joint reward and video supervision.
- Execute VLA policy rollouts within the world model and perform RL (GRPO) optimization.
- Deploy the refined policy in real-world. And in real-world deployment, new rollouts could be used to collect new Failure Trajectory and success data for further SANS dataset augmentation, which can be used to iteratively improve the world model and policy.
Full pipeline of our proposed framework.
Results
Success rates are computed across 500 rollouts for the LIBERO suites and 30 physical rollouts for our real-world experiments.
World Model Generation Results
Through iterative updating, our world model can generate videos that are highly aligned with conditioning action and adhere to physical dynamics. Note: there is a minor temporal mismatch between the ground-truth and generated videos. This discrepancy arises because we truncate the ground-truth action trajectories to the nearest multiple of our fixed action chunk size rather than using their full original length. To ensure consistent visual comparison, we synchronize the frame rates (FPS) of both videos to achieve an identical total duration.
Real-World
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
LIBERO-Object
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
LIBERO-Goal
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
LIBERO-Spatial
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
Success Trajectory
Ground Truth
Failure Trajectory
Ground Truth
World Model Generation in OOD Cases
Our world model still faithfully follow the action condition inputs even with random action sequence unseen during training.
RL Fine-tuned VLA Performance
After RL fine-tuning within our world model, VLA actions produce more accurate grasping.
Real-World (Original Speed)
LIBERO-Object
LIBERO-Goal
LIBERO-Spatial
Robustness to OOD Cases
Our model demonstrates robustness to out-of-distribution cases, indicating that the model retains original robustness after our RL fine-tuning.
Unseen Lighting Condition
Unseen Distractor
Unseen Distractors
Unseen Distractors with Disturbance
BibTeX
@article{YourPaperKey2024,
title={Your Paper Title Here},
author={First Author and Second Author and Third Author},
journal={Conference/Journal Name},
year={2024},
url={https://your-domain.com/your-project-page}
}