Skip to content

kianyale/PEAfowl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

🦚PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

Qingyu Fan1,2, Zhaoxiang Li2, Yi Lu2, Wang Chen2, Qiu Shen2, Xiao-xiao Long2,†, Yinghao Cai1,†, Tao Lu1, Shuo Wang1, Xun Cao1

1Institute of Automation, Chinese Academy of Sciences 2Nanjing University

†Corresponding Authors

Paper Project Page Video


📢 News & Roadmap

This repository is the official implementation of PEAfowl. We are currently preparing the code and data for release. Please stay tuned!

  • Release the Code (Training scripts).
  • Release Pre-trained Models.
  • Release Evaluation Scripts (RoboTwin2.0).
  • Release Real-Robot Control Code.

📖 Abstract

Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding.

In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors.

On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation.


Figure 1: Motivation and overview of PEAfowl.(a) Prior bimanual VLAs typically concatenate per-view visual tokens and apply global text conditioning, without explicit cross-view geometric alignment or instruction-relevant visual evidence retrieval and aggregation. (b) We propose PEAfowl, which incorporates geometry-guided multi-view fusion and a Perceiver-style text-as-query readout over frozen CLIP features. Bottom: Average success rates on RoboTwin 2.0 (nine training tasks) under Clean and Domain-Randomized settings, comparing multi-task bimanual baselines.

🚀 Method: PEAfowl

We introduce PEAfowl, a multi-view vision-language-action model with geometry- and language-guided perception for bimanual manipulation. On the spatial side, we propose a geometry-driven multi-view fusion module that (i) performs per-patch RGB-D token fusion, (ii) predicts a depth distribution for each visual token, and (iii) backprojects tokens into a shared 3D base frame to perform local 3D neighborhood aggregation across cameras. This design explicitly models cross-view geometric correspondences and endows 2D visual features with depth-aware, 3D-consistent structure.

On the language side, PEAfowl builds upon OTTER-style text-aware visual extraction, and replaces global text conditioning with a Perceiver-style text-aware transformer. Text tokens act as latent queries that iteratively cross-attend to per-view patch features, producing a compact set of language-conditioned visual tokens. This mechanism sharpens attention over task-relevant objects and spatial relations suitable for policy learning.


Figure 2: PEAfowl architecture. PEAfowl couples geometry-guided multi-view fusion with language-guided readout to condition a SEM-style diffusion action decoder. Top: RGB–D tokens are used to predict per-token depth distributions for differentiable 3D lifting and cross-view fusion; a pretrained camera depth model supervises the depth-distribution head during training only. Bottom: Frozen CLIP features are queried by a Perceiver-style text-as-query readout and pooled into compact context tokens.

Note: For more visualizations and real-world robot demos, please visit our Project Page.


📊 Results

Both simulation and physical experiments demonstrate that PEAfowl, despite having only 300M trainable parameters, outperforms existing bimanual VLA models and visuomotor baselines, and generalizes well to novel scene appearances, workspace geometries, and language instructions. In simulation, we evaluate on $9$ tasks from RoboTwin 2.0 under both clean and heavily domain-randomized settings. On physical platforms, PEAfowl demonstrates consistent sim-to-real transfer, particularly on tasks requiring precise perception and bimanual coordination.

About

This repository is the official implementation of 🦚PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors