R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation

Abstract

Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D foundation models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance.

A core innovation of R3DP is the asynchronous fast-slow collaboration module, which sophisticatedly integrates large-scale 3D model's priors into the policy without compromising real-time performance. The system queries the pre-trained slow system (VGGT) only on sparse keyframes, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. Additionally, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics.

R3DP offers a plug-and-play solution for integrating 3D foundation models into real-time inference systems. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rates. Furthermore, by decoupling heavy 3D understanding from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.

Method

Overview of the R3DP architecture. R3DP serves as a 3D-aware perception module that seamlessly replaces visual encoders in existing imitation learning frameworks. Within the AFSC module, sparse keyframes are processed by a 3D foundation model (VGGT), while intermediate frames are handled by our TFPNet for real-time temporal reasoning. MVFF module leverages cross-attention with PRoPE to fuse 2D-3D features into consistent multi-view representations for control.

Asynchronous Fast-Slow Collaboration (AFSC)

AFSC treats a large 3D vision foundation model as the slow system that takes current multi-view RGB images every τ inferences, computing high-fidelity Slow 3D-aware Features (S3DF). Meanwhile, a lightweight fast system (TFPNet) propagates these features to every intermediate frame using historical context to predict Real-time 3D-aware Features (R3DF).

Temporal Feature Prediction Network (TFPNet)

TFPNet is a lightweight network that leverages historical-frame 3D-aware features to predict current-frame features in real-time. It uses a DINOv2-S backbone with 4 Alternating-Attention Transformer blocks, followed by cross-attention to inject temporal priors from previous frames.

Multi-View Feature Fuser (MVFF)

MVFF explicitly leverages camera intrinsics and extrinsics for geometry-aware cross-view feature alignment. Using Projective Positional Encoding (PRoPE), it achieves consistent 3D-aware representation across viewpoints while respecting camera geometry.

Performance

R3DP achieves 32.9% higher success rate than single-view DP and 51.4% higher than multi-view DP in simulation benchmarks. In real-world experiments, it outperforms baselines by a significant margin. Furthermore, R3DP reduces observation encoding latency by 44.8% compared to naive DP+VGGT integration, enabling real-time control.

Architecture and training objective of TFPNet. For clarity, we show the unrolled structure for the first two timesteps; in practice, the network is trained over a sequence of four timesteps. TFPNet leverages historical information to augment current observations, enabling 3D-aware control with real-time inference efficiency.

Experiments

Simulation Results on RoboTwin Benchmark

Task	DP-single	DP-multi	DP3	π₀	R3DP (τ=4)	R3DP (τ=8)
Block Hammer Beat	0%	0%	49%	47%	77%	77%
Block Handover	1%	2%	48%	71%	95%	93%
Blocks Stack Easy	6%	7%	26%	79%	69%	62%
Shoe Place	45%	19%	49%	71%	72%	68%
Put Apple Cabinet	92%	38%	98%	64%	100%	98%
Tube Insert	92%	64%	97%	68%	97%	97%
Average	36.1%	17.6%	57.6%	59.9%	69.0%	65.7%

Real-World Results

Task	DP	DP3	DP+V+M	R3DP
Place Shoe	46.7%	50.0%	76.7%	86.7%
Place Glass Cup	23.3%	56.7%	76.7%	83.3%
Pick Peach	20%	30%	46.7%	50%
Stack Bowls	33.3%	56.7%	66.7%	66.7%
Average	30.8%	48.4%	66.7%	71.7%

Inference Latency Comparison

Latency (ms)	DP+VGGT	DP+VGGT+MVFF	R3DP (τ=4)	R3DP (τ=8)
Obs. Encoder	73.1	78.3	50.5 (↓30.9%)	40.3 (↓44.8%)

Depth Map Visualization

Simulation (Put Apple Cabinet)

Head

RGB

VGGT

Ours

Front

RGB

VGGT

Ours

Real-World (Stack Bowls)

Head

RGB

VGGT

Ours

Front

RGB

VGGT

Ours

Visualization of depth maps decoded from VGGT features and from our TFPNet-predicted features passed through VGGT's depth decoder. The close visual agreement indicates that our lightweight TFPNet effectively captures information generated by 3D foundation model in both simulation and real-world experiments.

Real-World Experiments

Place Shoe

Place Glass Cup

Pick Peach

Stack Bowls

Real-world experimental platforms and point cloud inputs. We evaluate our method on the ArmBot-Y1 bimanual robot for tasks including Place Shoe and Place Glass Cup, and a single-arm robot for Pick Peach and Stack Bowls. Both platforms are equipped with dual RealSense D435 cameras to generate the ground-truth point clouds used by the DP3 baseline.

Citation

If you find our work useful, please consider citing:

@article{r3dp2025,
  title={R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation},
  author={Zhang, Yuhao and Dong, Wanxi and Shi, Yue and Liang, Yi and Gao, Jingnan and Yang, Qiaochu and Lyu, Yaxing and Liang, Zhixuan and Liu, Yibin and Xu, Congsheng and Guo, Xianda and Sui, Wei and Jin, Yaohui and Yang, Xiaokang and Xu, Yanyan and Mu, Yao},
  journal={arXiv preprint},
  year={2025}
}