Action Images: End-to-End Policy Learning via Multiview Video Generation
Abstract
World action models that formulate policy learning as multiview video generation use pixel-grounded action images to enable zero-shot policy learning without separate action modules.
World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model (2026)
- DriveVA: Video Action Models are Zero-Shot Drivers (2026)
- VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents (2026)
- Scaling World Model for Hierarchical Manipulation Policies (2026)
- UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models (2026)
- MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation (2026)
- VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.06168 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper