Abstract
Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios.
Abstract (translated)
深度感知是基于3D视觉的机器人领域的一个重要问题。然而,现实世界中的主动立体或ToF深度相机通常会产生噪音和不完整的深度,从而限制了机器的性能。在这项工作中,我们提出了D3RoMa,一个基于学习的立体图像对深度估计框架,能够在多样室内场景中预测干净和准确的深度,即使在最具有挑战性的情况下,即透明或漫反射表面,经典深度感知也完全失败。我们方法的关键在于将深度估计和修复统一为图像到图像的迁移问题,通过预测噪声扩散概率模型中的差异图。在推理时间,我们进一步将左对齐约束作为分类器指导扩散过程。我们的框架结合了最近先进的基于学习的技术和传统立体视觉中的几何约束。为了模型训练,我们创建了一个大型的场景级合成数据集,包括各种透明和漫反射物体,以弥补现有的台式机数据集。训练好的模型可以直接应用于野外场景,在多个公开深度估计基准测试中实现最先进的表现。进一步的实验在真实环境中表明,准确的深度预测在各种场景中显著提高了机器人对各种物体的操作能力。
URL
https://arxiv.org/abs/2409.14365