HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

Headline Results

89%

Average success rate on six real-world in-domain tasks

82%

Global OOD average across background, position, and object shifts

+33%

OOD margin over prior state-of-the-art VLA models

48 Hz

action generation speed with action chunk size 12

In-Domain Real-World Results

Task-level success rates for four single-arm and two dual-arm tasks. HarmoWAM achieves the best average result.

Method	Pick Fruit	Stack Cans	Pour Coke	Write "Yes"	Put Flowers	Put Items	Average
π_0.5	0.80	0.68	0.75	0.83	0.72	0.67	0.74
VPP	0.80	0.60	0.78	0.73	—	—	0.73
Wan+AnyPos	0.88	0.60	0.78	0.72	0.53	0.52	0.67
QwenVLA-OFT	0.78	0.30	0.73	0.72	—	—	0.63
Cosmos-Policy	0.93	0.65	0.80	0.83	0.75	0.72	0.78
HarmoWAM (Ours)	0.95	0.90	0.88	0.92	0.85	0.85	0.89

Generalizable Transit

HarmoWAM uses world-model predictions to expand exploration space, helping the robot reach target objects under unseen backgrounds, positions, and object instances.

Precise Manipulation

The predictive expert conditions on latent video dynamics, supporting temporally coherent actions for contact-rich interaction and bimanual coordination.

Adaptive Coordination

Process-Adaptive Gating routes control between reactive and predictive experts according to the current task stage, forming a unified closed-loop WAM policy.

Generalization Performance

HarmoWAM maintains strong zero-shot performance across three OOD scenario types.

Method	Background	Position	Objects	Global Avg	Drop from ID
π_0.5	0.60	0.32	0.54	0.49	33.8%↓
VPP	0.43	0.23	0.57	0.41	43.8%↓
Wan+AnyPos	0.53	0.49	0.58	0.53	20.9%↓
QwenVLA-OFT	0.46	0.28	0.50	0.41	34.9%↓
Cosmos-Policy	0.57	0.26	0.50	0.44	43.6%↓
HarmoWAM (Ours)	0.81	0.80	0.85	0.82	7.9%↓

Real-World Experiments

HarmoWAM is evaluated on six real-world manipulation tasks, including four single-arm tasks and two dual-arm collaborative tasks.

Unseen position

Abstract

World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine-then-Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations.

Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution.

Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. A Process-Adaptive Gating Mechanism automatically determines the timing and location of switching between complementary experts. Across six real-world tasks and three training-unseen test environments, HarmoWAM significantly outperforms prior state-of-the-art VLA models and WAMs.

BibTeX

@misc{harmowam2026,
  title  = {HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models},
  author = {Feng, Qiuxuan and Yu, Jiale and Liu, Jiaming and Jia, Yueru and Wu, Zhuangzhe and Chen, Hao and Qian, Zezhong and Gu, Shuo and Jia, Peng and Ma, Siwei and Zhang, Shanghang},
  year   = {2026}
}