VoLo: A Physical Orchestrator for
Open-Vocabulary Long-Horizon Manipulation

Siyi Chen1,2,* Hugo Hadfield1 Alex Zook1 Mikaela Angelina Uy1 Chan Hee Song1 Erwin Coumans1 Xuning Yang1 Faisal Ladhak1 Qing Qu2 Stan Birchfield1 Jonathan Tremblay1,† Valts Blukis1,†
1NVIDIA 2University of Michigan Project Leads    *Work done during an internship at NVIDIA

TL;DR: We frame open-vocabulary long-horizon manipulation as physical orchestration and present VoLoAgent, a VLM agent that plans, monitors, and recovers by steering a VLA/WAM as an interruptible tool alongside perception models and grasp/place primitives. We also introduce RoboVoLo, a high-fidelity benchmark of 126 tasks across common sense, memory, complex references, and world knowledge, with task-level success and failure-mode diagnostics.

VoLo overview

VoLo overview. VoLoAgent plans, monitors (e.g., subgoal complete), and uses tools (e.g., VLA, SAM3) to act and recover from failures (e.g., wrong object). RoboVoLo is a high-fidelity benchmark for evaluating and diagnosing open-vocabulary long-horizon manipulation.

Abstract

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments.

RoboVoLo Benchmark

RoboVoLo benchmark taxonomy

RoboVoLo benchmark. 126 long-horizon manipulation tasks across 15 categories, grouped into four capability suites: Common Sense (infer intent from scene context), Memory (track state across actions), Complex References (resolve spatial, ordinal, size, and negation cues), and World Knowledge (apply external knowledge spanning math, art, chemistry, and recycling). Built on RoboLab and NVIDIA Isaac Lab, expanded with 501 new objects.

VoLoAgent & Physical Orchestration

Physical Orchestration

Virtual AI agents assume a world that holds still while the agent thinks, whereas a physical agent must reason while the world keeps moving. This imposes a core monitor–halt–redirect requirement: the agent must monitor for divergence between what it believes it accomplished and the actual scene, halt an in-flight action as quickly as possible when divergence is detected, and redirect by replanning, reissuing the action, or switching tools.

The VoLoAgent System

VoLoAgent system diagram

VoLoAgent system. A single VLM agent plans, monitors, and orchestrates tools (VLA/WAM rollouts, perception models, grasp/place primitives) through one closed-loop control law. The agent can interrupt a VLA rollout and switch to a different tool when execution drifts.

Unlike prior hierarchical systems that split control between a VLM planner and a VLA executor, here the VLA is one callable tool alongside perception models and grasp/place primitives. VoLoAgent realizes the monitor–halt–recover loop through three design choices:

  • Asynchronous tools. Robot motion runs independent of the agent's reasoning, so the agent interleaves monitoring with execution rather than blocking.
  • Fast and slow memory. A short monitor context (current observation, active subgoal, recent decisions) read close to the motion timescale (0.2 Hz), and a fuller deliberation context (task memory, scene history, tool catalog) consulted only at planning points.
  • Safety-aware idling. Holding the robot still when reasoning must continue mid-task.

A key emergent property is complementarity: action primitives inject perception grounding into the VLA, so even a failed grasp leaves the gripper near the target with a clean view for the VLA to finish the pick.

Main Results

Suite Category Single action model Code-as-policy TiPToP
(TAMP)
VoLoAgent (Ours)
π0.5 π0-FAST MolmoBot MolmoAct2 DreamZero CaP-X-s CaP-X-e No VLA Only VLA Full
Common
Sense
Infer 0.009.5214.290.0019.059.5214.294.7619.0552.3852.38
Kit 16.674.170.000.0012.5012.5016.678.3341.6733.3350.00
Recover 4.170.0012.5012.5020.8337.5029.170.0062.5045.8362.50
Sort 23.810.000.000.000.000.000.000.000.0047.6252.38
Overall 11.113.336.673.3313.3315.5615.563.3332.2244.4454.44
Memory Order 12.5025.0033.3325.0029.1716.6716.670.0025.0029.1754.17
Recall 23.333.3330.003.3321.4323.3323.333.336.6763.3356.67
Swap 3.330.006.673.330.006.676.670.0010.0010.003.33
Overall 13.108.3322.629.5215.8515.4815.481.1913.1034.5236.90
Complex
References
Spatial 14.8111.110.007.4111.117.417.4125.937.4129.6340.74
Counting 16.6712.5012.500.000.004.174.1712.504.1745.8354.17
Negation 16.670.000.000.000.000.000.0020.8325.0045.8354.17
Size+Sort 19.054.769.520.004.7619.0519.0523.810.0042.8657.14
Overall 16.677.295.212.084.177.297.2920.839.3840.6251.04
World
Knowledge
Art 0.000.000.000.000.000.000.0016.6716.674.178.33
Chem 8.330.0012.504.1712.504.174.1750.0029.1741.6754.17
Math 4.170.000.000.000.000.000.004.1720.830.0012.50
Recycle 25.000.004.170.000.004.174.1720.830.0037.5025.00
Overall 9.380.004.171.043.122.082.0822.9216.6720.8325.00
Robolab-
Vague
Easy 19.7910.9413.766.2519.7916.6715.1029.6919.7935.9434.90
Med 17.5411.4011.406.1418.8014.049.657.0216.6726.3230.70
Hard 5.563.703.770.0013.737.411.855.5612.9616.6724.07
Overall 16.9410.0011.525.2818.6114.4411.3918.8917.7830.0031.94

Results of various methods on RoboVoLo and the Robolab-Vague benchmark. All values are success rate (%, higher is better). Each task is run for 3 episodes. Bold = best in row; underline = second-best. VoLoAgent (Full) achieves the best long-horizon open-vocabulary manipulation performance, outperforming single-model, code-as-policy, and TAMP baselines on every suite.

Process Comparison

Process comparison across systems

Process comparison on two open-vocabulary long-horizon tasks, one row per system. Red tags mark failure events; green tags mark grasp-tool recovery events. When the VLA selects the wrong object, the grasp tool repositions the gripper on the correct target, and the VLA completes the contact-rich manipulation.

Failure Mode Analysis

World failure analysis

World failure analysis tracing episodes through failures, recovery, and outcomes for π0.5 (left) and VoLoAgent (right). VoLoAgent has 5× more failure-free episodes (20 vs. 4) and recovers from 54% (38/70) of failures vs. only 13% (11/86) for π0.5. Major failure subtypes: stuck, WOP = wrong object picked, WTP = wrong target place.

VLM failure audit

VLM failure audit. Left: one example per failure type (Planning, Completion-monitor, Failure-monitor, Tool-use). Right: per-VLM error counts across n=90 episodes. Completion-monitor errors dominate every backend; Claude Opus 4.6 reaches only 5% of the ceiling error count vs. 23% for Qwen3-VL-8B.

Real-World Robot Validation

Real robot examples

Real robot examples. Deployed on a real Franka FR3, VoLoAgent achieves 42.9% success vs. 14.3% for π0.5 across 14 RoboVoLo tasks (3 trials each), a 3× improvement. The agent monitors and recovers from failures such as wrong-place destination and wrong-object pick in the real world as well.

Citation

@article{chen2026volo,
    title   = {VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation},
    author  = {Chen, Siyi and Hadfield, Hugo and Zook, Alex and Uy, Mikaela Angelina and
               Song, Chan Hee and Coumans, Erwin and Yang, Xuning and Ladhak, Faisal and
               Qu, Qing and Birchfield, Stan and Tremblay, Jonathan and Blukis, Valts},
    journal = {arXiv preprint},
    year    = {2026}
}