🦚PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

Qingyu Fan^1,2, Zhaoxiang Li², Yi Lu², Wang Chen², Qiu Shen², Xiao-xiao Long^2,†, Yinghao Cai^1,†, Tao Lu¹, Shuo Wang¹, Xun Cao¹

¹Institute of Automation, Chinese Academy of Sciences ²Nanjing University

^†Corresponding Authors

📢 News & Roadmap

This repository is the official implementation of PEAfowl. We are currently preparing the code and data for release. Please stay tuned!

Release the Code (Training scripts).
Release Pre-trained Models.
Release Evaluation Scripts (RoboTwin2.0).
Release Real-Robot Control Code.

📖 Abstract

Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding.

In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors.

On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation.

Figure 1: Motivation and overview of PEAfowl.(a) Prior bimanual VLAs typically concatenate per-view visual tokens and apply global text conditioning, without explicit cross-view geometric alignment or instruction-relevant visual evidence retrieval and aggregation. (b) We propose PEAfowl, which incorporates geometry-guided multi-view fusion and a Perceiver-style text-as-query readout over frozen CLIP features. Bottom: Average success rates on RoboTwin 2.0 (nine training tasks) under Clean and Domain-Randomized settings, comparing multi-task bimanual baselines.

🚀 Method: PEAfowl

We introduce PEAfowl, a multi-view vision-language-action model with geometry- and language-guided perception for bimanual manipulation. On the spatial side, we propose a geometry-driven multi-view fusion module that (i) performs per-patch RGB-D token fusion, (ii) predicts a depth distribution for each visual token, and (iii) backprojects tokens into a shared 3D base frame to perform local 3D neighborhood aggregation across cameras. This design explicitly models cross-view geometric correspondences and endows 2D visual features with depth-aware, 3D-consistent structure.

On the language side, PEAfowl builds upon OTTER-style text-aware visual extraction, and replaces global text conditioning with a Perceiver-style text-aware transformer. Text tokens act as latent queries that iteratively cross-attend to per-view patch features, producing a compact set of language-conditioned visual tokens. This mechanism sharpens attention over task-relevant objects and spatial relations suitable for policy learning.

Figure 2: PEAfowl architecture. PEAfowl couples geometry-guided multi-view fusion with language-guided readout to condition a SEM-style diffusion action decoder. Top: RGB–D tokens are used to predict per-token depth distributions for differentiable 3D lifting and cross-view fusion; a pretrained camera depth model supervises the depth-distribution head during training only. Bottom: Frozen CLIP features are queried by a Perceiver-style text-as-query readout and pooled into compact context tokens.

Note: For more visualizations and real-world robot demos, please visit our Project Page.

📊 Results

Both simulation and physical experiments demonstrate that PEAfowl, despite having only 300M trainable parameters, outperforms existing bimanual VLA models and visuomotor baselines, and generalizes well to novel scene appearances, workspace geometries, and language instructions. In simulation, we evaluate on $9$ tasks from RoboTwin 2.0 under both clean and heavily domain-randomized settings. On physical platforms, PEAfowl demonstrates consistent sim-to-real transfer, particularly on tasks requiring precise perception and bimanual coordination.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦚PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

📢 News & Roadmap

📖 Abstract

🚀 Method: PEAfowl

📊 Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🦚PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

📢 News & Roadmap

📖 Abstract

🚀 Method: PEAfowl

📊 Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages