Paper Reading AI Learner

AutoDepthNet: High Frame Rate Depth Map Reconstruction using Commodity Depth and RGB Cameras

2023-05-24 05:09:43
Peyman Gholami, Robert Xiao

Abstract

Depth cameras have found applications in diverse fields, such as computer vision, artificial intelligence, and video gaming. However, the high latency and low frame rate of existing commodity depth cameras impose limitations on their applications. We propose a fast and accurate depth map reconstruction technique to reduce latency and increase the frame rate in depth cameras. Our approach uses only a commodity depth camera and color camera in a hybrid camera setup; our prototype is implemented using a Kinect Azure depth camera at 30 fps and a high-speed RGB iPhone 11 Pro camera captured at 240 fps. The proposed network, AutoDepthNet, is an encoder-decoder model that captures frames from the high-speed RGB camera and combines them with previous depth frames to reconstruct a stream of high frame rate depth maps. On GPU, with a 480 x 270 output resolution, our system achieves an inference time of 8 ms, enabling real-time use at up to 200 fps with parallel processing. AutoDepthNet can estimate depth values with an average RMS error of 0.076, a 44.5% improvement compared to an optical flow-based comparison method. Our method can also improve depth map quality by estimating depth values for missing and invalidated pixels. The proposed method can be easily applied to existing depth cameras and facilitates the use of depth cameras in applications that require high-speed depth estimation. We also showcase the effectiveness of the framework in upsampling different sparse datasets e.g. video object segmentation. As a demonstration of our method, we integrated our framework into existing body tracking systems and demonstrated the robustness of the proposed method in such applications.

Abstract (translated)

深度相机在各种领域都有广泛的应用,例如计算机视觉、人工智能和视频游戏。然而,现有 commodity 深度相机的高延迟和低帧率限制了其应用。我们提出了一种快速且准确的深度地图重构技术,以降低深度相机的延迟并提高帧率。我们的算法仅使用一种 commodity 深度相机和彩色相机在一个混合相机系统中实现;我们的原型使用 Kinect Azure 深度相机以 30 fps 和 iPhone 11 Pro 的高速度 RGB 相机以 240 fps 捕捉帧数。我们提出的网络名为 AutoDepthNet,它是一种编码-解码模型,从高速度 RGB 相机捕获帧并将其与前深度帧组合以重构高帧率深度地图流。在 GPU 上,输出分辨率为 480 x 270,我们的系统实现了 8 毫秒的推断时间,可以在并行处理下使用高达 200 fps。AutoDepthNet 的平均RMS误差为 0.076,比基于光学流的比较方法提高了 44.5%。我们的算法还可以通过估计缺失和损坏像素的深度值来提高深度地图质量。 proposed 方法可以轻松应用于现有的深度相机,并方便深度相机在需要快速深度估计的应用中使用。我们还展示了框架在增加不同稀疏数据集(例如视频物体分割)的采样效果方面的有效性。作为演示,我们将我们的框架集成到现有的身体跟踪系统中,并展示了该方法在这种应用中的鲁棒性。

URL

https://arxiv.org/abs/2305.14731

PDF

https://arxiv.org/pdf/2305.14731.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot