R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Yuhao Zhang1,2*, Wanxi Dong3*, Yue Shi1*, Yi Liang1*, Jingnan Gao1,
Qiaochu Yang1, Yaxing Lyu6, Zhixuan Liang4, Yibin Liu7, Congsheng Xu1,
Xianda Guo5,2, Wei Sui2, Yaohui Jin1, Xiaokang Yang1, Yanyan Xu1, Yao Mu1†
1Shanghai Jiao Tong University 2D-Robotics 3Southern University of Science and Technology
4The University of Hong Kong 5Wuhan University 6Xiamen University Malaysia 7Northeastern University

*Equal Contribution   Corresponding Author

R3DP Overview

Key modules and performance of R3DP. Our framework explicitly integrates 3D priors from large-scale foundation models (e.g., VGGT) via an asynchronous fast-slow collaboration mechanism. Overall, R3DP achieves real-time, 3D-aware inference, significantly improving both manipulation success rates and processing frequency.

Abstract

Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D foundation models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance.

A core innovation of R3DP is the asynchronous fast-slow collaboration module, which sophisticatedly integrates large-scale 3D model's priors into the policy without compromising real-time performance. The system queries the pre-trained slow system (VGGT) only on sparse keyframes, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. Additionally, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics.

R3DP offers a plug-and-play solution for integrating 3D foundation models into real-time inference systems. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rates. Furthermore, by decoupling heavy 3D understanding from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.

Method

R3DP Architecture

Overview of the R3DP architecture. R3DP serves as a 3D-aware perception module that seamlessly replaces visual encoders in existing imitation learning frameworks. Within the AFSC module, sparse keyframes are processed by a 3D foundation model (VGGT), while intermediate frames are handled by our TFPNet for real-time temporal reasoning. MVFF module leverages cross-attention with PRoPE to fuse 2D-3D features into consistent multi-view representations for control.

Asynchronous Fast-Slow Collaboration (AFSC)

AFSC treats a large 3D vision foundation model as the slow system that takes current multi-view RGB images every τ inferences, computing high-fidelity Slow 3D-aware Features (S3DF). Meanwhile, a lightweight fast system (TFPNet) propagates these features to every intermediate frame using historical context to predict Real-time 3D-aware Features (R3DF).

Temporal Feature Prediction Network (TFPNet)

TFPNet is a lightweight network that leverages historical-frame 3D-aware features to predict current-frame features in real-time. It uses a DINOv2-S backbone with 4 Alternating-Attention Transformer blocks, followed by cross-attention to inject temporal priors from previous frames.

Multi-View Feature Fuser (MVFF)

MVFF explicitly leverages camera intrinsics and extrinsics for geometry-aware cross-view feature alignment. Using Projective Positional Encoding (PRoPE), it achieves consistent 3D-aware representation across viewpoints while respecting camera geometry.

Performance

R3DP achieves 32.9% higher success rate than single-view DP and 51.4% higher than multi-view DP in simulation benchmarks. In real-world experiments, it outperforms baselines by a significant margin. Furthermore, R3DP reduces observation encoding latency by 44.8% compared to naive DP+VGGT integration, enabling real-time control.

TFPNet Architecture

Architecture and training objective of TFPNet. For clarity, we show the unrolled structure for the first two timesteps; in practice, the network is trained over a sequence of four timesteps. TFPNet leverages historical information to augment current observations, enabling 3D-aware control with real-time inference efficiency.

Experiments

Simulation Results on RoboTwin Benchmark

Task DP-single DP-multi DP3 π0 R3DP (τ=4) R3DP (τ=8)
Block Hammer Beat 0% 0% 49% 47% 77% 77%
Block Handover 1% 2% 48% 71% 95% 93%
Blocks Stack Easy 6% 7% 26% 79% 69% 62%
Shoe Place 45% 19% 49% 71% 72% 68%
Put Apple Cabinet 92% 38% 98% 64% 100% 98%
Tube Insert 92% 64% 97% 68% 97% 97%
Average 36.1% 17.6% 57.6% 59.9% 69.0% 65.7%

Real-World Results

Task DP DP3 DP+V+M R3DP
Place Shoe 46.7% 50.0% 76.7% 86.7%
Place Glass Cup 23.3% 56.7% 76.7% 83.3%
Pick Peach 20% 30% 46.7% 50%
Stack Bowls 33.3% 56.7% 66.7% 66.7%
Average 30.8% 48.4% 66.7% 71.7%

Inference Latency Comparison

Latency (ms) DP+VGGT DP+VGGT+MVFF R3DP (τ=4) R3DP (τ=8)
Obs. Encoder 73.1 78.3 50.5 (↓30.9%) 40.3 (↓44.8%)

Depth Map Visualization

Simulation (Put Apple Cabinet)

Head
RGB Head RGB
VGGT Head VGGT
Ours Head Ours
Front
RGB Front RGB
VGGT Front VGGT
Ours Front Ours

Real-World (Stack Bowls)

Head
RGB Head RGB
VGGT Head VGGT
Ours Head Ours
Front
RGB Front RGB
VGGT Front VGGT
Ours Front Ours

Visualization of depth maps decoded from VGGT features and from our TFPNet-predicted features passed through VGGT's depth decoder. The close visual agreement indicates that our lightweight TFPNet effectively captures information generated by 3D foundation model in both simulation and real-world experiments.

Real-World Experiments

Place Shoe Place Shoe Point Cloud Place Shoe
Place Glass Cup Place Glass Cup Point Cloud Place Glass Cup
Pick Peach Pick Peach Point Cloud Pick Peach
Stack Bowls Stack Bowls Point Cloud Stack Bowls

Real-world experimental platforms and point cloud inputs. We evaluate our method on the ArmBot-Y1 bimanual robot for tasks including Place Shoe and Place Glass Cup, and a single-arm robot for Pick Peach and Stack Bowls. Both platforms are equipped with dual RealSense D435 cameras to generate the ground-truth point clouds used by the DP3 baseline.

Citation

If you find our work useful, please consider citing:

@article{r3dp2025,
  title={R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation},
  author={Zhang, Yuhao and Dong, Wanxi and Shi, Yue and Liang, Yi and Gao, Jingnan and Yang, Qiaochu and Lyu, Yaxing and Liang, Zhixuan and Liu, Yibin and Xu, Congsheng and Guo, Xianda and Sui, Wei and Jin, Yaohui and Yang, Xiaokang and Xu, Yanyan and Mu, Yao},
  journal={arXiv preprint},
  year={2025}
}