HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

HarmoWAM overview figure
The Overview of HarmoWAM. We propose HarmoWAM, an end-to-end WAM that jointly achieves generalizable transit and precise manipulation through a world model that provides physical dynamics priors and adaptively coordinates a predictive action expert and a reactive action expert. HarmoWAM achieves SOTA performance in ID settings and exhibits a substantial advantage in OOD scenarios.

HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

Qiuxuan Feng1* Jiale Yu1* Jiaming Liu1*† Yueru Jia1* Zhuangzhe Wu1 Hao Chen3 Zezhong Qian1 Shuo Gu2 Peng Jia2 Siwei Ma1 Shanghang Zhang1✉
1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Simplexity Robotics 3The Chinese University of Hong Kong
*Equal contribution Project lead Corresponding author
HarmoWAM overview video. Representative real-world demonstrations across in-domain and out-of-domain scenarios.

Headline Results

89%
Average success rate on six real-world in-domain tasks
82%
Global OOD average across background, position, and object shifts
+33%
OOD margin over prior state-of-the-art VLA models
48 Hz
action generation speed with action chunk size 12

In-Domain Real-World Results

Task-level success rates for four single-arm and two dual-arm tasks. HarmoWAM achieves the best average result.

MethodPick FruitStack CansPour CokeWrite "Yes"Put FlowersPut ItemsAverage
π0.50.800.680.750.830.720.670.74
VPP0.800.600.780.730.73
Wan+AnyPos0.880.600.780.720.530.520.67
QwenVLA-OFT0.780.300.730.720.63
Cosmos-Policy0.930.650.800.830.750.720.78
HarmoWAM (Ours)0.950.900.880.920.850.850.89
Generalizable Transit

HarmoWAM uses world-model predictions to expand exploration space, helping the robot reach target objects under unseen backgrounds, positions, and object instances.

Precise Manipulation

The predictive expert conditions on latent video dynamics, supporting temporally coherent actions for contact-rich interaction and bimanual coordination.

Adaptive Coordination

Process-Adaptive Gating routes control between reactive and predictive experts according to the current task stage, forming a unified closed-loop WAM policy.

Generalization Performance

HarmoWAM maintains strong zero-shot performance across three OOD scenario types.

MethodBackgroundPositionObjectsGlobal AvgDrop from ID
π0.50.600.320.540.4933.8%↓
VPP0.430.230.570.4143.8%↓
Wan+AnyPos0.530.490.580.5320.9%↓
QwenVLA-OFT0.460.280.500.4134.9%↓
Cosmos-Policy0.570.260.500.4443.6%↓
HarmoWAM (Ours)0.810.800.850.827.9%↓

Real-World Experiments

HarmoWAM is evaluated on six real-world manipulation tasks, including four single-arm tasks and two dual-arm collaborative tasks.

Task Videos

Pick Fruit to Plate single-arm

Original · Object · Background · Position
Original setting
Unseen object
Unseen background
Unseen position

Stack Coke Cans single-arm

Original · Object · Background · Position
Original setting
Unseen object
Unseen background
Unseen position

Pour Coke into Beaker single-arm

Original · Object · Background · Position
Original setting
Unseen object
Unseen background
Unseen position

Write "Yes" single-arm

Original · Object · Background · Position
Original setting
Unseen object
Unseen background
Unseen position

Put Flowers in Vase dual-arm

Original · Object · Background · Position
Original setting
Unseen object
Unseen background
Unseen position

Put Items to Bag and Zip dual-arm

Original · Object · Background · Position
Original setting
Unseen object
Unseen background
Unseen position

Abstract

World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine-then-Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations.

Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution.

Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. A Process-Adaptive Gating Mechanism automatically determines the timing and location of switching between complementary experts. Across six real-world tasks and three training-unseen test environments, HarmoWAM significantly outperforms prior state-of-the-art VLA models and WAMs.

BibTeX

@misc{harmowam2026,
  title  = {HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models},
  author = {Feng, Qiuxuan and Yu, Jiale and Liu, Jiaming and Jia, Yueru and Wu, Zhuangzhe and Chen, Hao and Qian, Zezhong and Gu, Shuo and Jia, Peng and Ma, Siwei and Zhang, Shanghang},
  year   = {2026}
}