Paper Reading AI Learner

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

2023-03-21 16:54:01
Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, Limin Wang

Abstract

In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at this https URL.

Abstract (translated)

在本文中,我们研究从同步的2D和3D数据中 jointly estimating optical flow和场景流动的问题。以前的方法和方法要么使用复杂的管道将联合任务划分为独立阶段,要么在“早期融合”或“晚期融合”的方式下将2D和3D信息融合。这种适用于所有情况的方法面临一个困境,即未能充分利用每种模式的特性或最大化它们之间的互补性。为了解决这一问题,我们提出了一种全新的端到端框架,该框架由2D和3D分支,它们在特定的层中具有多个双向融合连接。与以前的工作不同,我们应用基于点基的3D分支来提取LiDAR特征,因为它保持了点云的几何结构。为了融合密集图像特征和稀疏点特征,我们提出了一种可学习的操作名称双向相机-LiDAR融合模块(Bi-CLFM)。我们实例化两种双向融合管道类型,一种基于Pyramidal Fine-to-Fine架构(称为 CamLiPWC),另一种基于循环全部对区域变换(称为 CamLiRAFT)。在飞行物体3D中, CamLiPWC和 CamLiRAFT都超越了所有现有方法,并在最佳公开结果上实现了3D端点误差的47.9%减少。我们的最优模型 CamLiRAFT 在KITTI场景Flow基准测试中实现了4.26%的错误,成为所有提交中参数更少的佼佼者。此外,我们的方法具有强大的泛化性能和处理非定域运动的能力。代码可在本网站的 https URL 中获取。

URL

https://arxiv.org/abs/2303.12017

PDF

https://arxiv.org/pdf/2303.12017.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot