Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

Abstract
Abstract (translated)
URL
PDF

Abstract

In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at this https URL.

Abstract (translated)

在本文中,我们研究从同步的2D和3D数据中 jointly estimating optical flow和场景流动的问题。以前的方法和方法要么使用复杂的管道将联合任务划分为独立阶段,要么在“早期融合”或“晚期融合”的方式下将2D和3D信息融合。这种适用于所有情况的方法面临一个困境,即未能充分利用每种模式的特性或最大化它们之间的互补性。为了解决这一问题,我们提出了一种全新的端到端框架,该框架由2D和3D分支,它们在特定的层中具有多个双向融合连接。与以前的工作不同,我们应用基于点基的3D分支来提取LiDAR特征,因为它保持了点云的几何结构。为了融合密集图像特征和稀疏点特征,我们提出了一种可学习的操作名称双向相机-LiDAR融合模块(Bi-CLFM)。我们实例化两种双向融合管道类型,一种基于Pyramidal Fine-to-Fine架构(称为 CamLiPWC),另一种基于循环全部对区域变换(称为 CamLiRAFT)。在飞行物体3D中, CamLiPWC和 CamLiRAFT都超越了所有现有方法,并在最佳公开结果上实现了3D端点误差的47.9%减少。我们的最优模型 CamLiRAFT 在KITTI场景Flow基准测试中实现了4.26%的错误,成为所有提交中参数更少的佼佼者。此外,我们的方法具有强大的泛化性能和处理非定域运动的能力。代码可在本网站的 https URL 中获取。

URL

https://arxiv.org/abs/2303.12017

PDF

https://arxiv.org/pdf/2303.12017.pdf