Qingyu Fan1,2, Zhaoxiang Li2, Yi Lu2, Wang Chen2, Qiu Shen2, Xiao-xiao Long2,†, Yinghao Cai1,†, Tao Lu1, Shuo Wang1, Xun Cao1
1Institute of Automation, Chinese Academy of Sciences 2Nanjing University
†Corresponding Authors
This repository is the official implementation of PEAfowl. We are currently preparing the code and data for release. Please stay tuned!
- Release the Code (Training scripts).
- Release Pre-trained Models.
- Release Evaluation Scripts (RoboTwin2.0).
- Release Real-Robot Control Code.
Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding.
In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors.
On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation.
Figure 1: Motivation and overview of PEAfowl.(a) Prior bimanual VLAs typically concatenate per-view visual tokens and apply global text conditioning, without explicit cross-view geometric alignment or instruction-relevant visual evidence retrieval and aggregation. (b) We propose PEAfowl, which incorporates geometry-guided multi-view fusion and a Perceiver-style text-as-query readout over frozen CLIP features. Bottom: Average success rates on RoboTwin 2.0 (nine training tasks) under Clean and Domain-Randomized settings, comparing multi-task bimanual baselines.
We introduce PEAfowl, a multi-view vision-language-action model with geometry- and language-guided perception for bimanual manipulation. On the spatial side, we propose a geometry-driven multi-view fusion module that (i) performs per-patch RGB-D token fusion, (ii) predicts a depth distribution for each visual token, and (iii) backprojects tokens into a shared 3D base frame to perform local 3D neighborhood aggregation across cameras. This design explicitly models cross-view geometric correspondences and endows 2D visual features with depth-aware, 3D-consistent structure.
On the language side, PEAfowl builds upon OTTER-style text-aware visual extraction, and replaces global text conditioning with a Perceiver-style text-aware transformer. Text tokens act as latent queries that iteratively cross-attend to per-view patch features, producing a compact set of language-conditioned visual tokens. This mechanism sharpens attention over task-relevant objects and spatial relations suitable for policy learning.
Figure 2: PEAfowl architecture. PEAfowl couples geometry-guided multi-view fusion with language-guided readout to condition a SEM-style diffusion action decoder. Top: RGB–D tokens are used to predict per-token depth distributions for differentiable 3D lifting and cross-view fusion; a pretrained camera depth model supervises the depth-distribution head during training only. Bottom: Frozen CLIP features are queried by a Perceiver-style text-as-query readout and pooled into compact context tokens.
Note: For more visualizations and real-world robot demos, please visit our Project Page.
Both simulation and physical experiments demonstrate that PEAfowl, despite having only 300M trainable parameters, outperforms existing bimanual VLA models and visuomotor baselines, and generalizes well to novel scene appearances, workspace geometries, and language instructions. In simulation, we evaluate on