Manual pruning of radiata pine trees presents significant safety risks due to their substantial height and the challenging terrains in which they thrive. To address these risks, this research proposes the development of a drone-based pruning system equipped with specialized pruning tools and a stereo vision camera, enabling precise detection and trimming of branches. Deep learning algorithms, including YOLO and Mask R-CNN, are employed to ensure accurate branch detection, while the Semi-Global Matching algorithm is integrated to provide reliable distance estimation. The synergy between these techniques facilitates the precise identification of branch locations and enables efficient, targeted pruning. Experimental results demonstrate that the combined implementation of YOLO and SGBM enables the drone to accurately detect branches and measure their distances from the drone. This research not only improves the safety and efficiency of pruning operations but also makes a significant contribution to the advancement of drone technology in the automation of agricultural and forestry practices, laying a foundational framework for further innovations in environmental management.
手动剪枝松树树存在显著的安全风险,因为它们的高度巨大,生长环境具有挑战性。为解决这些风险,这项研究提出了一个采用无人机和专用剪枝工具以及立体视觉摄像头的剪枝系统,实现对树枝的准确检测和修剪。深度学习算法,包括YOLO和Mask R-CNN,用于确保准确分支检测,而半全局匹配算法被集成以提供可靠的距离估计。这些技术之间的协同作用有助于精确确定分支位置,实现有针对性的剪枝。实验结果表明,YOLO和SGBM的联合应用使无人机能够准确检测树枝并测量其距离,从而提高了修剪操作的安全性和效率。这项研究不仅提高了修剪操作的安全性和效率,还对无人机在农业和林业实践中的自动化发展做出了重要贡献,为环境管理中进一步的创新奠定了基础。
https://arxiv.org/abs/2409.17526
Heavy-duty operations, typically performed using heavy-duty hydraulic manipulators (HHMs), are susceptible to environmental contact due to tracking errors or sudden environmental changes. Therefore, beyond precise control design, it is crucial that the manipulator be resilient to potential impacts without relying on contact-force sensors, which mostly cannot be utilized. This paper proposes a novel force-sensorless robust impact-resilient controller for a generic 6-degree-of-freedom (DoF) HHM constituting from anthropomorphic arm and spherical wrist mechanisms. The scheme consists of a neuroadaptive subsystem-based impedance controller, which is designed to ensure both accurate tracking of position and orientation with stabilization of HHMs upon contact, along with a novel generalized momentum observer, which is for the first time introduced in Plücker coordinate, to estimate the impact force. Finally, by leveraging the concepts of virtual stability and virtual power flow, the semi-global uniformly ultimately boundedness of the entire system is assured. To demonstrate the efficacy and versatility of the proposed method, extensive experiments were conducted using a generic 6-DoF industrial HHM. The experimental results confirm the exceptional performance of the designed method by achieving a subcentimeter tracking accuracy and by 80% reduction of impact of the contact.
重载操作通常使用重型液压操纵器(HHMs)进行,由于跟踪误差或突然环境变化,容易受到环境接触。因此,在精确控制设计的基础上,确保无接触力传感器,这是主要无法利用的,是至关重要的。本文提出了一种新颖的力传感器less的稳健碰撞 resilient控制器,用于构成具有人形手臂和球形手腕机制的6度自由度(DOF)HHM。方案包括一个基于神经适应子系统的不稳定阻抗控制器,用于确保接触时HHMs的精确位置和姿态跟踪,以及一个新颖的泛克氏坐标系中首次引入的新通用动量观察器,用于估计冲击力。最后,通过利用虚稳定和虚功率流的概念,确保了整个系统的半全局唯一界。为了验证所提出方法的有效性和多样性,使用通用6度自由度工业HHM进行了广泛的实验。实验结果证实了设计方法出色的性能和多样性,通过实现亚毫米跟踪精度和接触冲击的80%减少,证实了设计方法非常有效。
https://arxiv.org/abs/2408.09147
Crowd counting is gaining societal relevance, particularly in domains of Urban Planning, Crowd Management, and Public Safety. This paper introduces Fourier-guided attention (FGA), a novel attention mechanism for crowd count estimation designed to address the inefficient full-scale global pattern capture in existing works on convolution-based attention networks. FGA efficiently captures multi-scale information, including full-scale global patterns, by utilizing Fast-Fourier Transformations (FFT) along with spatial attention for global features and convolutions with channel-wise attention for semi-global and local features. The architecture of FGA involves a dual-path approach: (1) a path for processing full-scale global features through FFT, allowing for efficient extraction of information in the frequency domain, and (2) a path for processing remaining feature maps for semi-global and local features using traditional convolutions and channel-wise attention. This dual-path architecture enables FGA to seamlessly integrate frequency and spatial information, enhancing its ability to capture diverse crowd patterns. We apply FGA in the last layers of two popular crowd-counting works, CSRNet and CANNet, to evaluate the module's performance on benchmark datasets such as ShanghaiTech-A, ShanghaiTech-B, UCF-CC-50, and JHU++ crowd. The experiments demonstrate a notable improvement across all datasets based on Mean-Squared-Error (MSE) and Mean-Absolute-Error (MAE) metrics, showing comparable performance to recent state-of-the-art methods. Additionally, we illustrate the interpretability using qualitative analysis, leveraging Grad-CAM heatmaps, to show the effectiveness of FGA in capturing crowd patterns.
人群计数在社会意义上有越来越多的重要性,特别是在城市规划、人群管理和公共安全等领域。本文介绍了一种名为Fourier引导关注(FGA)的新 crowd count 估计注意力机制,旨在解决现有基于卷积基注意网络的工作中的全局模式捕捉效率低下的问题。FGA 通过利用 Fast-Fourier Transforms(FFT)和空间注意力来捕捉多尺度信息,包括全局模式,从而有效地提取频率域中的信息。此外,FGA 使用传统的卷积和通道级注意力来处理剩余的半全局和局部特征。这种双路径架构使得FGA 可以轻松整合频率和空间信息,提高其捕捉多样 crowd 模式的能力。我们在两个流行的 crowd-counting 作品中应用 FGA,分别是 CSRNet 和 CANNet,以评估其在基准数据集(如上海Tech-A、上海Tech-B、UCF-CC-50 和 JHU+)上的性能。实验结果表明,所有数据集基于平均平方误差(MSE)和平均绝对误差(MAE)的评估结果都有显著的改进,显示出与最近最先进的方法的性能相当。此外,我们通过使用 Grad-CAM 热图进行定性分析,展示了 FGA 在捕捉 crowd patterns 中的有效性。
https://arxiv.org/abs/2407.06110
Palm oil production has been identified as one of the major drivers of deforestation for tropical countries. To meet supply chain objectives, commodity producers and other stakeholders need timely information of land cover dynamics in their supply shed. However, such data are difficult to obtain from suppliers who may lack digital geographic representations of their supply sheds and production locations. Here we present a "community model," a machine learning model trained on pooled data sourced from many different stakeholders, to develop a specific land cover probability map, in this case a semi-global oil palm map. An advantage of this method is the inclusion of varied inputs, the ability to easily update the model as new training data becomes available and run the model on any year that input imagery is available. Inclusion of diverse data sources into one probability map can help establish a shared understanding across stakeholders on the presence and absence of a land cover or commodity (in this case oil palm). The model predictors are annual composites built from publicly available satellite imagery provided by Sentinel-1, Sentinel-2, and ALOS DSM. We provide map outputs as the probability of palm in a given pixel, to reflect the uncertainty of the underlying state (palm or not palm). The initial version of this model provides global accuracy estimated to be approximately 90% (at 0.5 probability threshold) from spatially partitioned test data. This model, and resulting oil palm probability map products are useful for accurately identifying the geographic footprint of palm cultivation. Used in conjunction with timely deforestation information, this palm model is useful for understanding the risk of continued oil palm plantation expansion in sensitive forest areas.
棕榈油生产被认为是热带国家森林砍伐的主要驱动因素之一。为了满足供应链目标,商品生产商和其他利益相关方需要及时了解他们在供应链中的土地覆盖动态。然而,从可能缺乏数字地理表示的供应商处获得这种数据是非常困难的。在这里,我们提出了一个“社区模型”,一种基于许多不同利益相关者共同提供的数据集的机器学习模型,以开发特定的土地覆盖概率图,例如半全球油棕地图。这种方法的优势包括纳入各种输入数据、能够轻松更新模型以获取新训练数据并运行模型以处理任何可用的输入图像的能力。将多样数据来源集成到一个概率图中可以帮助利益相关者之间建立对土地覆盖或商品(例如油棕)存在和缺失的共同理解。 模型预测器是由来自Sentinel-1、Sentinel-2和ALOS DSM等公共卫星影像的年度复合构建的。我们提供每个像素的概率棕榈油,以反映底层状态的不确定性(棕榈或不是棕榈)。这个模型的最初版本估计全球准确度约为90%(在0.5概率阈值下)。这个模型及其油棕概率图产品对于准确识别棕榈油种植的地理足迹非常有用。与及时的森林砍伐信息相结合,这个棕榈油模型对于理解敏感森林区域中持续油棕种植扩展的风险非常有用。
https://arxiv.org/abs/2405.09530
In the reconstruction of façade elements, the identification of specific object types remains challenging and is often circumvented by rectangularity assumptions or the use of bounding boxes. We propose a new approach for the reconstruction of 3D façade details. We combine MLS point clouds and a pre-defined 3D model library using a BoW concept, which we augment by incorporating semi-global features. We conduct experiments on the models superimposed with random noise and on the TUM-FAÇADE dataset. Our method demonstrates promising results, improving the conventional BoW approach. It holds the potential to be utilized for more realistic facade reconstruction without rectangularity assumptions, which can be used in applications such as testing automated driving functions or estimating façade solar potential.
在外墙元素的重构中,确定具体物体类型仍然具有挑战性,并且通常通过矩形性假设或使用边界框来规避。我们提出了一种新的外墙细节重构方法。我们通过结合MLS点云和预定义的3D模型库,使用BW概念进行增强,并引入半全局特征。我们在包含随机噪声的外模型的实验和TUM-FAÇADE数据集上进行了实验。我们的方法取得了很好的结果,改善了传统的BW方法。它具有在不考虑矩形性假设的情况下进行更真实外墙重构的潜力,可以应用于自动驾驶功能的测试或估计外墙太阳能潜力等应用。
https://arxiv.org/abs/2402.06521
Vast industrial investment along with increased academic research on hydraulic heavy-duty manipulators has unavoidably paved the way for their automatization, necessitating the design of robust and high-precision controllers. In this study, an orchestrated robust controller is designed to address the mentioned issue. To do so, the entire robotic system is decomposed into subsystems, and a robust controller is designed at each local subsystem by considering unknown model uncertainties, unknown disturbances, and compound input constraints, thanks to virtual decomposition control (VDC). As such, radial basic function neural networks (RBFNNs) are incorporated into VDC to tackle unknown disturbances and uncertainties, resulting in novel decentralized RBFNNs. All these robust local controllers designed at each local subsystem are, then, orchestrated to accomplish high-precision control. In the end, for the first time in the context of VDC, a semi-globally uniformly ultimate boundedness is achieved under the designed controller. The validity of the theoretical results is verified by performing extensive simulations and experiments on a 6-degrees-of-freedom industrial manipulator with a nominal lifting capacity of $600\, kg$ at $5$ meters reach. Comparing the simulation result to state-of-the-art controller along with provided experimental results, demonstrates that the proposed method established all the promises and performed excellently.
巨大的工业投资和液压重型操纵器增加的学术研究不可避免地为它们的自动化开辟了道路,迫使设计具有稳健和高精度的控制器。在这项研究中,设计了一个有组织的稳健控制器来解决上述问题。为了做到这一点,整个机器人系统被分解为子系统,然后通过虚拟分解控制(VDC)在每个局部子系统中设计一个稳健的控制器,考虑未知模型不确定性、未知干扰和复合输入约束。因此,径向基本功能神经网络(RBFNNs)被引入到VDC中解决未知干扰和不确定性,从而实现新颖的分布式RBFNN。所有在局部子系统设计的稳健控制器都被编排成实现高精度控制。最后,在VDC的背景下,实现了半全局均匀终极有界性。通过在具有名义提升能力为600公斤的6度自由度工业吊车上的广泛仿真和实验来验证理论结果,证明了所提出的方法取得了巨大的成功。
https://arxiv.org/abs/2312.06304
In Ultrasound Localization Microscopy (ULM), achieving high-resolution images relies on the precise localization of contrast agent particles across consecutive beamformed frames. However, our study uncovers an enormous potential: The process of delay-and-sum beamforming leads to an irreversible reduction of Radio-Frequency (RF) data, while its implications for localization remain largely unexplored. The rich contextual information embedded within RF wavefronts, including their hyperbolic shape and phase, offers great promise for guiding Deep Neural Networks (DNNs) in challenging localization scenarios. To fully exploit this data, we propose to directly localize scatterers in RF signals. Our approach involves a custom super-resolution DNN using learned feature channel shuffling and a novel semi-global convolutional sampling block tailored for reliable and accurate localization in RF input data. Additionally, we introduce a geometric point transformation that facilitates seamless mapping between B-mode and RF spaces. To validate the effectiveness of our method and understand the impact of beamforming, we conduct an extensive comparison with State-Of-The-Art (SOTA) techniques in ULM. We present the inaugural in vivo results from an RF-trained DNN, highlighting its real-world practicality. Our findings show that RF-ULM bridges the domain gap between synthetic and real datasets, offering a considerable advantage in terms of precision and complexity. To enable the broader research community to benefit from our findings, our code and the associated SOTA methods are made available at this https URL.
在超声波定位显微镜(ULM)中,获得高分辨率图像依赖于连续波形帧中均匀定位 contrast agent 颗粒的位置。然而,我们的研究揭示了一个巨大的潜力:延迟和相加的波形编码过程会导致射频数据 irreversible 减少,而对其定位的影响则在很大程度上未被探索。射频前端中的丰富上下文信息,包括其椭圆形状和相位,为在挑战性的定位场景中引导深度神经网络(DNN)提供了巨大的潜力。为了充分利用这些数据,我们提议直接定位射频信号中的散射剂。我们的方案涉及使用学习的特征通道交换自定义的高性能分辨率 DNN 和一个专门设计的全新半全局卷积采样块,以在射频输入数据中可靠且准确地定位。此外,我们引入了一种几何点变换,以方便在 B 模式和射频空间之间的无缝映射。为了验证我们的方法的有效性并理解波形编码的影响,我们进行了广泛的比较 ULM 中的最新技术。我们呈现了从射频训练的 DNN 中提取的最初的 vivo 结果,强调了其实际实用性。我们的研究结果表明,RF-ULM 跨越了合成数据和真实数据数据集之间的领域差距,在精度和复杂性方面提供了相当大的优势。为了让更多的人受益于我们的研究结果,我们的代码和相关的最新技术方法在此 https URL 上提供。
https://arxiv.org/abs/2310.01545
Digital surface model generation using traditional multi-view stereo matching (MVS) performs poorly over non-Lambertian surfaces, with asynchronous acquisitions, or at discontinuities. Neural radiance fields (NeRF) offer a new paradigm for reconstructing surface geometries using continuous volumetric representation. NeRF is self-supervised, does not require ground truth geometry for training, and provides an elegant way to include in its representation physical parameters about the scene, thus potentially remedying the challenging scenarios where MVS fails. However, NeRF and its variants require many views to produce convincing scene's geometries which in earth observation satellite imaging is rare. In this paper we present SparseSat-NeRF (SpS-NeRF) - an extension of Sat-NeRF adapted to sparse satellite views. SpS-NeRF employs dense depth supervision guided by crosscorrelation similarity metric provided by traditional semi-global MVS matching. We demonstrate the effectiveness of our approach on stereo and tri-stereo Pleiades 1B/WorldView-3 images, and compare against NeRF and Sat-NeRF. The code is available at this https URL
使用传统的多视图三角测量(MVS)方法生成数字表面模型在非Lambertian表面、 asynchronous acquisition 或中断处表现不佳。神经光辐射场(NeRF)提供了使用连续体积表示重构表面几何学的新范式。 NeRF是自我监督的,不需要 ground truth 几何学进行训练,并提供了一种优雅的方式,将其表示的物理参数包括在它的表示中,从而可能修复在 MVS 失败时的 challenging 场景。然而, NeRF 及其变体需要许多视图才能产生有说服力的场景几何学,这在地球观测卫星图像中是非常罕见的。在本文中,我们介绍了稀疏卫星-NeRF(SpS-NeRF) - Sat-NeRF 的扩展,以适应稀疏卫星视图。 SpS-NeRF采用Dense depth supervision,受传统半全局MVS匹配提供的交叉相关相似度度量的指导。我们证明了我们的方法在三角和三角多视图 Pleiades 1B/WorldView-3 图像中的 effectiveness,并对比了 NeRF 和Sat-NeRF。代码在此https URL 可用。
https://arxiv.org/abs/2309.00277
This paper proposes a nonlinear stochastic complementary filter design for inertial navigation that takes advantage of a fusion of Ultra-wideband (UWB) and Inertial Measurement Unit (IMU) technology ensuring semi-global uniform ultimate boundedness (SGUUB) of the closed loop error signals in mean square. The proposed filter estimates the vehicle's orientation, position, linear velocity, and noise covariance. The filter is designed to mimic the nonlinear navigation motion kinematics and is posed on a matrix Lie Group, the extended form of the Special Euclidean Group $\mathbb{SE}_{2}\left(3\right)$. The Lie Group based structure of the proposed filter provides unique and global representation avoiding singularity (a common shortcoming of Euler angles) as well as non-uniqueness (a common limitation of unit-quaternion). Unlike Kalman-type filters, the proposed filter successfully addresses IMU measurement noise considering unknown upper-bounded covariance. Although the navigation estimator is proposed in a continuous form, the discrete version is also presented. Moreover, the unit-quaternion implementation has been provided in the Appendix. Experimental validation performed using a publicly available real-world six-degrees-of-freedom (6 DoF) flight dataset obtained from an unmanned Micro Aerial Vehicle (MAV) illustrating the robustness of the proposed navigation technique. Keywords: Sensor-fusion, Inertial navigation, Ultra-wideband ranging, Inertial measurement unit, Stochastic differential equation, Stability, Localization, Observer design.
本文提出了一种非线性随机互补滤波设计,用于惯性导航,利用 Ultra-wideband (UWB) 和惯性测量单元 (IMU) 技术,确保半全局均匀的终极限制(SGUUB)的闭环误差信号平方meanSquare的平均值。该滤波估计了车辆的方向、位置、线性速度和噪声责任。该滤波设计模拟非线性导航运动学,并将其放置在矩阵Lie Group,即特别欧几里得组$\mathbb{SE}_{2}\left(3\right)的扩展形式。该滤波基于Lie Group的结构提供了独特的和全球表示,以避免 singularities( Euler 角度的常见缺点)和 non-uniqueness(单位元quaternion的常见限制)。与Kalman类型的滤波不同,该滤波成功地解决了IMU测量噪声,考虑未知的upper-bounded责任密度。虽然导航估计器是连续形式的,但离散形式也呈现。此外,单位元quaternion实现已提供附录。使用从无人Micro Aerial Vehicle (MAV) 获得的公开可用的现实世界六自由度(6 DoF)飞行数据集,通过实验验证展示了该导航技术的可靠性。关键词:传感器融合、惯性导航、UWB超宽带测量、惯性测量单元、随机微分方程、稳定性、定位、观测设计。
https://arxiv.org/abs/2308.13393
Time of Flight (ToF) is a prevalent depth sensing technology in the fields of robotics, medical imaging, and non-destructive testing. Yet, ToF sensing faces challenges from complex ambient conditions making an inverse modelling from the sparse temporal information intractable. This paper highlights the potential of modern super-resolution techniques to learn varying surroundings for a reliable and accurate ToF detection. Unlike existing models, we tailor an architecture for sub-sample precise semi-global signal localization by combining super-resolution with an efficient residual contraction block to balance between fine signal details and large scale contextual information. We consolidate research on ToF by conducting a benchmark comparison against six state-of-the-art methods for which we employ two publicly available datasets. This includes the release of our SToF-Chirp dataset captured by an airborne ultrasound transducer. Results showcase the superior performance of our proposed StofNet in terms of precision, reliability and model complexity. Our code is available at this https URL.
时间反演(ToF)是机器人、医学成像和非破坏性测试等领域中广泛应用的深部探测技术。然而,ToF感知面临复杂的环境挑战,使得传统的逆模型难以实现。这篇论文强调了现代超分辨率技术学习不同环境范围的潜力,以进行可靠和准确的ToF检测。与现有模型不同,我们设计了一个 sub-样本精确半全局信号定位架构,通过结合超分辨率和高效的剩余收缩块,平衡了精细信号细节和大规模上下文信息。我们通过与六个最先进的方法进行基准比较,巩固了ToF研究,其中包括由空中超声波传感器捕获的我们的SToF-Chirp dataset的发布。结果展示了我们提出的StofNet在精度、可靠性和模型复杂性方面的优越性能。我们的代码可在该httpsURL上可用。
https://arxiv.org/abs/2308.12009
Recent works have shown that depth information can be obtained from Dual-Pixel (DP) sensors. A DP arrangement provides two views in a single shot, thus resembling a stereo image pair with a tiny baseline. However, the different point spread function (PSF) per view, as well as the small disparity range, makes the use of typical stereo matching algorithms problematic. To address the above shortcomings, we propose a Continuous Cost Aggregation (CCA) scheme within a semi-global matching framework that is able to provide accurate continuous disparities from DP images. The proposed algorithm fits parabolas to matching costs and aggregates parabola coefficients along image paths. The aggregation step is performed subject to a quadratic constraint that not only enforces the disparity smoothness but also maintains the quadratic form of the total costs. This gives rise to an inherently efficient disparity propagation scheme with a pixel-wise minimization in closed-form. Furthermore, the continuous form allows for a robust multi-scale aggregation that better compensates for the varying PSF. Experiments on DP data from both DSLR and phone cameras show that the proposed scheme attains state-of-the-art performance in DP disparity estimation.
最近的工作表明,从双像素(DP)传感器可以获得深度信息。DP安排在一次拍摄中提供两个视角,因此类似于具有微小基线的立体图像对。然而,每个视角不同的点扩散函数(PSF)以及微小的差距范围使得使用典型的立体匹配算法存在问题。为了解决上述缺点,我们提出了一种连续成本聚合(CCA)方案,在一个半全球匹配框架内,它能够从DP图像中提供准确的连续差距。该提议算法将匹配成本与图像路径上的椭圆系数相 fit,并在整个图像路径上聚合椭圆系数。聚合步骤受到一个二阶约束,不仅强迫差距平滑,而且保持总成本的二阶形式。这产生了一个基于像素最小化的封闭形式的高效差距传播方案。此外,连续形式允许进行稳健的多尺度聚合,更好地补偿不断变化的PSF。从DSLR和手机相机收集的DP数据的实验表明,该提议方案在DP差距估计方面实现了先进的性能。
https://arxiv.org/abs/2306.07921
In this paper, we propose a real-time FPGA implementation of the Semi-Global Matching (SGM) stereo vision algorithm. The designed module supports a 4K/Ultra HD (3840 x 2160 pixels @ 30 frames per second) video stream in a 4 pixel per clock (ppc) format and a 64-pixel disparity range. The baseline SGM implementation had to be modified to process pixels in the 4ppc format and meet the timing constrains, however, our version provides results comparable to the original design. The solution has been positively evaluated on the Xilinx VC707 development board with a Virtex-7 FPGA device.
https://arxiv.org/abs/2301.04847
The image-based visual servoing without models of system is challenging since it is hard to fetch an accurate estimation of hand-eye relationship via merely visual measurement. Whereas, the accuracy of estimated hand-eye relationship expressed in local linear format with Jacobian matrix is important to whole system's performance. In this article, we proposed a finite-time controller as well as a Jacobian matrix estimator in a combination of online and offline way. The local linear formulation is formulated first. Then, we use a combination of online and offline method to boost the estimation of the highly coupled and nonlinear hand-eye relationship with data collected via depth camera. A neural network (NN) is pre-trained to give a relative reasonable initial estimation of Jacobian matrix. Then, an online updating method is carried out to modify the offline trained NN for a more accurate estimation. Moreover, sliding mode control algorithm is introduced to realize a finite-time controller. Compared with previous methods, our algorithm possesses better convergence speed. The proposed estimator possesses excellent performance in the accuracy of initial estimation and powerful tracking capabilities for time-varying estimation for Jacobian matrix compared with other data-driven estimators. The proposed scheme acquires the combination of neural network and finite-time control effect which drives a faster convergence speed compared with the exponentially converge ones. Another main feature of our algorithm is that the state signals in system is proved to be semi-global practical finite-time stable. Several experiments are carried out to validate proposed algorithm's performance.
https://arxiv.org/abs/2211.11178
Deep learning (DL) stereo matching methods gained great attention in remote sensing satellite datasets. However, most of these existing studies conclude assessments based only on a few/single stereo images lacking a systematic evaluation on how robust DL methods are on satellite stereo images with varying radiometric and geometric configurations. This paper provides an evaluation of four DL stereo matching methods through hundreds of multi-date multi-site satellite stereo pairs with varying geometric configurations, against the traditional well-practiced Census-SGM (Semi-global matching), to comprehensively understand their accuracy, robustness, generalization capabilities, and their practical potential. The DL methods include a learning-based cost metric through convolutional neural networks (MC-CNN) followed by SGM, and three end-to-end (E2E) learning models using Geometry and Context Network (GCNet), Pyramid Stereo Matching Network (PSMNet), and LEAStereo. Our experiments show that E2E algorithms can achieve upper limits of geometric accuracies, while may not generalize well for unseen data. The learning-based cost metric and Census-SGM are rather robust and can consistently achieve acceptable results. All DL algorithms are robust to geometric configurations of stereo pairs and are less sensitive in comparison to the Census-SGM, while learning-based cost metrics can generalize on satellite images when trained on different datasets (airborne or ground-view).
https://arxiv.org/abs/2210.14031
We present a self-supervised learning (SSL) method suitable for semi-global tasks such as object detection and semantic segmentation. We enforce local consistency between self-learned features, representing corresponding image locations of transformed versions of the same image, by minimizing a pixel-level local contrastive (LC) loss during training. LC-loss can be added to existing self-supervised learning methods with minimal overhead. We evaluate our SSL approach on two downstream tasks -- object detection and semantic segmentation, using COCO, PASCAL VOC, and CityScapes datasets. Our method outperforms the existing state-of-the-art SSL approaches by 1.9% on COCO object detection, 1.4% on PASCAL VOC detection, and 0.6% on CityScapes segmentation.
https://arxiv.org/abs/2207.04398
Deep learning (DL) methods are widely investigated for stereo image matching tasks due to their reported high accuracies. However, their transferability/generalization capabilities are limited by the instances seen in the training data. With satellite images covering large-scale areas with variances in locations, content, land covers, and spatial patterns, we expect their performances to be impacted. Increasing the number and diversity of training data is always an option, but with the ground-truth disparity being limited in remote sensing due to its high cost, it is almost impossible to obtain the ground-truth for all locations. Knowing that classical stereo matching methods such as Census-based semi-global-matching (SGM) are widely adopted to process different types of stereo data, we therefore, propose a finetuning method that takes advantage of disparity maps derived from SGM on target stereo data. Our proposed method adopts a simple scheme that uses the energy map derived from the SGM algorithm to select high confidence disparity measurements, at the same utilizing the images to limit these selected disparity measurements on texture-rich regions. Our approach aims to investigate the possibility of improving the transferability of current DL methods to unseen target data without having their ground truth as a requirement. To perform a comprehensive study, we select 20 study-sites around the world to cover a variety of complexities and densities. We choose well-established DL methods like geometric and context network (GCNet), pyramid stereo matching network (PSMNet), and LEAStereo for evaluation. Our results indicate an improvement in the transferability of the DL methods across different regions visually and numerically.
https://arxiv.org/abs/2205.14051
The complementary fusion of light detection and ranging (LiDAR) data and image data is a promising but challenging task for generating high-precision and high-density point clouds. This study proposes an innovative LiDAR-guided stereo matching approach called LiDAR-guided stereo matching (LGSM), which considers the spatial consistency represented by continuous disparity or depth changes in the homogeneous region of an image. The LGSM first detects the homogeneous pixels of each LiDAR projection point based on their color or intensity similarity. Next, we propose a riverbed enhancement function to optimize the cost volume of the LiDAR projection points and their homogeneous pixels to improve the matching robustness. Our formulation expands the constraint scopes of sparse LiDAR projection points with the guidance of image information to optimize the cost volume of pixels as much as possible. We applied LGSM to semi-global matching and AD-Census on both simulated and real datasets. When the percentage of LiDAR points in the simulated datasets was 0.16%, the matching accuracy of our method achieved a subpixel level, while that of the original stereo matching algorithm was 3.4 pixels. The experimental results show that LGSM is suitable for indoor, street, aerial, and satellite image datasets and provides good transferability across semi-global matching and AD-Census. Furthermore, the qualitative and quantitative evaluations demonstrate that LGSM is superior to two state-of-the-art optimizing cost volume methods, especially in reducing mismatches in difficult matching areas and refining the boundaries of objects.
https://arxiv.org/abs/2202.09953
Visual exploration is a task that seeks to visit all the navigable areas of an environment as quickly as possible. The existing methods employ deep reinforcement learning (RL) as the standard tool for the task. However, they tend to be vulnerable to statistical shifts between the training and test data, resulting in poor generalization over novel environments that are out-of-distribution (OOD) from the training data. In this paper, we attempt to improve the generalization ability by utilizing the inductive biases available for the task. Employing the active neural SLAM (ANS) that learns exploration policies with the advantage actor-critic (A2C) method as the base framework, we first point out that the mappings represented by the actor and the critic should satisfy specific symmetries. We then propose a network design for the actor and the critic to inherently attain these symmetries. Specifically, we use $G$-convolution instead of the standard convolution and insert the semi-global polar pooling (SGPP) layer, which we newly design in this study, in the last section of the critic network. Experimental results show that our method increases area coverage by $8.1 m^2$ when trained on the Gibson dataset and tested on the MP3D dataset, establishing the new state-of-the-art.
https://arxiv.org/abs/2112.09515
With FaSS-MVS, we present an approach for fast multi-view stereo with surface-aware Semi-Global Matching that allows for rapid depth and normal map estimation from monocular aerial video data captured by UAVs. The data estimated by FaSS-MVS, in turn, facilitates online 3D mapping, meaning that a 3D map of the scene is immediately and incrementally generated while the image data is acquired or being received. FaSS-MVS is comprised of a hierarchical processing scheme in which depth and normal data, as well as corresponding confidence scores, are estimated in a coarse-to-fine manner, allowing to efficiently process large scene depths which are inherent to oblique imagery captured by low-flying UAVs. The actual depth estimation employs a plane-sweep algorithm for dense multi-image matching to produce depth hypotheses from which the actual depth map is extracted by means of a surface-aware semi-global optimization, reducing the fronto-parallel bias of SGM. Given the estimated depth map, the pixel-wise surface normal information is then computed by reprojecting the depth map into a point cloud and calculating the normal vectors within a confined local neighborhood. In a thorough quantitative and ablative study we show that the accuracies of the 3D information calculated by FaSS-MVS is close to that of state-of-the-art approaches for offline multi-view stereo, with the error not even being one magnitude higher than that of COLMAP. At the same time, however, the average run-time of FaSS-MVS to estimate a single depth and normal map is less than 14 % of that of COLMAP, allowing to perform an online and incremental processing of Full-HD imagery at 1-2 Hz.
https://arxiv.org/abs/2112.00821
A robust nonlinear stochastic observer for simultaneous localization and mapping (SLAM) is proposed using the available uncertain measurements of angular velocity, translational velocity, and features. The proposed observer is posed on the Lie Group of $\mathbb{SLAM}_{n}\left(3\right)$ to mimic the true stochastic SLAM dynamics. The proposed approach considers the velocity measurements to be attached with an unknown bias and an unknown Gaussian noise. The proposed SLAM observer ensures that the closed loop error signals are semi-globally uniformly ultimately bounded. Simulation results demonstrates the efficiency and robustness of the proposed approach, revealing its ability to localize the unknown vehicle, as well as mapping the unknown environment given measurements obtained from low-cost units.
https://arxiv.org/abs/2109.06323