DUAL-SYSTEM WORLD MODELS FOR EMBODIED AI

DualWorld

A Tale of Two Worlds: Enabling Whole-Body Manipulation
Through Consistent Visual Guidance

⚡ Fast Reasoner: V-JEPA (600M) @ 30Hz
+
đź§  Slow Thinker: Wan2.2 (5B) @ 1Hz
Scroll to explore

Whole-Body Manipulation in Action

See DualWorld enabling humanoid robots to perform complex manipulation tasks in diverse real-world scenarios

Real-World Whole-Body Manipulation

Loading demonstrations...

Temporally-Consistent Visual Predictions (32 frames, 6.4s)

Kitchen Scene

Spray Bottle

Table Interaction

These predictions demonstrate visual consistency across the full manipulation horizon. Objects maintain coherent appearance, motions follow physically-plausible trajectories, and spatial relationships remain stable—providing reliable guidance for whole-body control throughout the sequence.

From Visual Guidance to Motor Commands

V-JEPA Wipe Features

Wiping Motion

V-JEPA Spray Features

Spray Action

V-JEPA Trash Features

Trash Disposal

V-JEPA translates the world model's consistent visual predictions into action-relevant features at 30Hz. These 1280D features capture how to manipulate objects (affordances), where the body should move (trajectories), and how to handle occlusions—enabling precise whole-body control that follows the visual guidance.

Consistent Visual Guidance for Whole-Body Control

Current Vision-Language-Action (VLA) models face two major bottlenecks: scarce action-labeled data and reliance on manual task segmentation, which limit scalability and policy transfer. While world models trained on internet-scale video data offer a promising alternative, they often lack semantic structure in their representations, and the small diffusion backbones used for real-time control weaken physical reasoning. To overcome these limitations, we introduce DualWorld, a dual-world model system that operates asynchronously in visual space: A “slow” planning model​ performs full-frame rollouts over ~6-second horizons using a medium-sized video diffusion model, preserving strong generative priors for long-term reasoning. A “fast” reactive model​ encodes both predicted and real-time visual observations into actionable representations using a state-of-the-art video encoder, enabling real-time control. This design allows learning from large-scale, diverse-quality datasets—including human videos and cross-embodiment robot trajectories—without manual annotation or curation. Across manipulation and loco-manipulation benchmarks, DualWorld matches or surpasses leading VLA and compact world-model policies in success rate and generalization, demonstrating a viable path toward practical embodied AI.

đź§  Slow Thinker: Long-Horizon Visual Guidance

Wan2.2 5B - World Model for Consistency

Generates 33-frame coherent video predictions (6.6 seconds) that maintain visual consistency across the entire manipulation horizon. This continuous visual "script" guides the robot through complex whole-body movements.

  • Internet-scale physics priors (5B parameters)
  • Explicit pixel-space predictions for interpretability
  • Maintains spatial-temporal coherence across full sequence

⚡ Fast Reasoner: Real-Time Action Extraction

V-JEPA 600M - Motion-Aware Encoder

Extracts action-conditioned features at 30Hz from the world model's predictions. Pre-trained on Ego4D to understand motion dynamics, it translates consistent visual guidance into precise motor commands.

  • Processes current observation + future predictions
  • ~40ms per action chunk generation
  • Captures affordances, trajectories, occlusions

Key Innovation: The world model provides consistent visual guidance throughout the entire manipulation sequence, while the fast encoder ensures real-time responsiveness (30Hz). This dual-system design enables humanoid robots to execute complex whole-body movements with both long-horizon coherence and immediate reactivity.

State-of-the-Art Performance

DualWorld outperforms leading VLA systems including π0.5 and GR00T across multiple benchmarks

Performance Comparison Benchmark

Performance Comparison: Success rate across diverse manipulation tasks. DualWorld achieves significant improvements over existing methods.

900ms
Wan2.2 Generation
40ms
Action Generation
4x
Prediction Reuse
30Hz
Control Frequency

Dual-System Design

Modular design: Slow Thinker (Wan2.2) + Fast Reasoner (V-JEPA) + Spatial Forcing (VGGT) + ActionDiT

DualWorld Architecture Diagram

Complete Architecture: The Slow Thinker (Wan2.2 5B) generates 33-frame predictions in background thread, while Fast Reasoner (V-JEPA 600M) extracts action-conditioned features in main thread. Spatial Forcing aligns with frozen VGGT-1B for implicit 3D understanding.

Training & Inference Pipeline

Our asynchronous streaming mechanism: Wan2.2 requires 900ms for video prediction; V-JEPA + ActionDiT generates actions within 40ms; control loop operates at 30Hz; 4x prediction reuse strategy amortizes computational cost.

Prediction Reuse & Training Pipeline

Asynchronous Architecture: Background thread generates predictions (~900ms) and populates FIFO queue (size=2). Main thread consumes with <1ms latency, achieving 30Hz control. Each prediction is reused 4 times through temporal indexing.

How It Works

Building Consistent Visual Guidance

Whole-body manipulation requires maintaining visual coherence across the full sequence. Our approach: generate explicit pixel-space predictions that preserve temporal consistency, then extract action-relevant features at high frequency—ensuring the robot "sees" a coherent visual script throughout complex movements.

1

Generate Coherent Visual Predictions

Wan2.2 (5B params) generates 33-frame predictions covering 6.6 seconds of manipulation. Full VAE decoding to pixel space ensures visual consistency—objects maintain appearance, motions follow physics, spatial relationships stay coherent. This provides a reliable "visual script" for the entire sequence.

2

Extract Motion-Aware Features

V-JEPA processes current observation + future predictions (4 key frames) to extract 1280D features that understand how objects should be manipulated. Pre-trained on Ego4D with temporal objectives, it captures affordances, trajectories, and occlusion handling—translating visual guidance into actionable understanding.

3

Add 3D Spatial Understanding

Align V-JEPA features with frozen VGGT-1B (3D-aware vision model) via cosine similarity loss. This adds implicit depth understanding to the 2D visual guidance, enabling precise whole-body control in 3D space without requiring depth sensors.

4

Generate Whole-Body Motor Commands

ActionDiT (200M params) generates 30-step action chunks that follow the consistent visual guidance. By combining temporal coherence from world model predictions with spatial precision from VGGT alignment, it produces smooth whole-body movements coordinated across the full manipulation horizon.

Why This Enables Whole-Body Manipulation

Visual Consistency

Explicit pixel predictions maintain coherent appearance and motion across full sequence, providing reliable guidance for complex movements

Real-Time Responsiveness

Asynchronous design decouples slow prediction (900ms) from fast control (30Hz), achieving both long-horizon consistency and immediate reactivity

Spatial-Temporal Understanding

VGGT alignment adds 3D awareness to temporally-coherent predictions, enabling whole-body coordination in physical space

Continuous Guidance Stream

Prediction reuse (~4x per sequence) ensures smooth transitions and consistent guidance throughout manipulation without gaps