A Tale of Two Worlds: Enabling Whole-Body Manipulation
Through Consistent Visual Guidance
See DualWorld enabling humanoid robots to perform complex manipulation tasks in diverse real-world scenarios
Loading demonstrations...
Kitchen Scene
Spray Bottle
Table Interaction
These predictions demonstrate visual consistency across the full manipulation horizon. Objects maintain coherent appearance, motions follow physically-plausible trajectories, and spatial relationships remain stable—providing reliable guidance for whole-body control throughout the sequence.
Wiping Motion
Spray Action
Trash Disposal
V-JEPA translates the world model's consistent visual predictions into action-relevant features at 30Hz. These 1280D features capture how to manipulate objects (affordances), where the body should move (trajectories), and how to handle occlusions—enabling precise whole-body control that follows the visual guidance.
Current Vision-Language-Action (VLA) models face two major bottlenecks: scarce action-labeled data and reliance on manual task segmentation, which limit scalability and policy transfer. While world models trained on internet-scale video data offer a promising alternative, they often lack semantic structure in their representations, and the small diffusion backbones used for real-time control weaken physical reasoning. To overcome these limitations, we introduce DualWorld, a dual-world model system that operates asynchronously in visual space: A “slow” planning model​ performs full-frame rollouts over ~6-second horizons using a medium-sized video diffusion model, preserving strong generative priors for long-term reasoning. A “fast” reactive model​ encodes both predicted and real-time visual observations into actionable representations using a state-of-the-art video encoder, enabling real-time control. This design allows learning from large-scale, diverse-quality datasets—including human videos and cross-embodiment robot trajectories—without manual annotation or curation. Across manipulation and loco-manipulation benchmarks, DualWorld matches or surpasses leading VLA and compact world-model policies in success rate and generalization, demonstrating a viable path toward practical embodied AI.
Generates 33-frame coherent video predictions (6.6 seconds) that maintain visual consistency across the entire manipulation horizon. This continuous visual "script" guides the robot through complex whole-body movements.
Extracts action-conditioned features at 30Hz from the world model's predictions. Pre-trained on Ego4D to understand motion dynamics, it translates consistent visual guidance into precise motor commands.
Key Innovation: The world model provides consistent visual guidance throughout the entire manipulation sequence, while the fast encoder ensures real-time responsiveness (30Hz). This dual-system design enables humanoid robots to execute complex whole-body movements with both long-horizon coherence and immediate reactivity.
DualWorld outperforms leading VLA systems including π0.5 and GR00T across multiple benchmarks
Performance Comparison: Success rate across diverse manipulation tasks. DualWorld achieves significant improvements over existing methods.
Modular design: Slow Thinker (Wan2.2) + Fast Reasoner (V-JEPA) + Spatial Forcing (VGGT) + ActionDiT
Complete Architecture: The Slow Thinker (Wan2.2 5B) generates 33-frame predictions in background thread, while Fast Reasoner (V-JEPA 600M) extracts action-conditioned features in main thread. Spatial Forcing aligns with frozen VGGT-1B for implicit 3D understanding.
Our asynchronous streaming mechanism: Wan2.2 requires 900ms for video prediction; V-JEPA + ActionDiT generates actions within 40ms; control loop operates at 30Hz; 4x prediction reuse strategy amortizes computational cost.
Asynchronous Architecture: Background thread generates predictions (~900ms) and populates FIFO queue (size=2). Main thread consumes with <1ms latency, achieving 30Hz control. Each prediction is reused 4 times through temporal indexing.
Whole-body manipulation requires maintaining visual coherence across the full sequence. Our approach: generate explicit pixel-space predictions that preserve temporal consistency, then extract action-relevant features at high frequency—ensuring the robot "sees" a coherent visual script throughout complex movements.
Wan2.2 (5B params) generates 33-frame predictions covering 6.6 seconds of manipulation. Full VAE decoding to pixel space ensures visual consistency—objects maintain appearance, motions follow physics, spatial relationships stay coherent. This provides a reliable "visual script" for the entire sequence.
V-JEPA processes current observation + future predictions (4 key frames) to extract 1280D features that understand how objects should be manipulated. Pre-trained on Ego4D with temporal objectives, it captures affordances, trajectories, and occlusion handling—translating visual guidance into actionable understanding.
Align V-JEPA features with frozen VGGT-1B (3D-aware vision model) via cosine similarity loss. This adds implicit depth understanding to the 2D visual guidance, enabling precise whole-body control in 3D space without requiring depth sensors.
ActionDiT (200M params) generates 30-step action chunks that follow the consistent visual guidance. By combining temporal coherence from world model predictions with spatial precision from VGGT alignment, it produces smooth whole-body movements coordinated across the full manipulation horizon.
Explicit pixel predictions maintain coherent appearance and motion across full sequence, providing reliable guidance for complex movements
Asynchronous design decouples slow prediction (900ms) from fast control (30Hz), achieving both long-horizon consistency and immediate reactivity
VGGT alignment adds 3D awareness to temporally-coherent predictions, enabling whole-body coordination in physical space
Prediction reuse (~4x per sequence) ensures smooth transitions and consistent guidance throughout manipulation without gaps