Paper Reading AI Learner

DERD-Net: Learning Depth from Event-based Ray Densities

2025-04-22 12:58:05
Diego de Oliveira Hitzges, Suman Ghosh, Guillermo Gallego

Abstract

Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: this https URL

Abstract (translated)

事件相机为多视角立体深度估计和同时定位与地图构建(SLAM)提供了一个有前景的途径,因为它们能够在高速条件下检测不受模糊影响的3D边缘,并且能够处理广泛的光照条件。然而,传统为常规摄像头设计的深度学习框架在处理事件数据时遇到了挑战,由于其架构优化的是离散、图像类别的输入,而无法很好地适应异步流式的事件数据特性。 我们提出了一种可扩展、灵活和适应性强的框架,在单目和立体设置下使用事件相机进行逐像素深度估计。通过已知的摄像头姿态将事件反投影到空间中来获取光线的空间密度,并将其编码为视差空间图像(DSIs),从而表示3D场景结构。我们的神经网络处理DSI中的局部子区域,结合了3D卷积和递归结构以识别有价值的模式来进行深度预测。 局部处理能够实现快速推理的完全并行化,确保无论摄像头分辨率如何,模型复杂性和内存成本始终保持在常量且极低水平。实验表明,在标准基准测试(MVSEC 和 DSEC 数据集)上取得了前所未有的效果: - 使用纯单目数据时,我们的方法与现有的立体方法具有可比的结果; - 当应用于立体数据时,它显著优于所有最先进的(SOTA)方法,将平均绝对误差至少减少了42%; - 我们的方法还允许深度完整性的提高超过3倍,同时仍然减少中值绝对误差至少30%。 鉴于其卓越的性能和对事件数据的有效处理能力,我们的框架在基于事件的深度估计和SLAM领域使用深度学习方面具有成为标准方法的巨大潜力。项目页面:此链接(请将“this https URL”替换为实际的项目页面URL)。

URL

https://arxiv.org/abs/2504.15863

PDF

https://arxiv.org/pdf/2504.15863.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot