Accurate position estimation is essential for modern navigation systems deployed in autonomous platforms, including ground vehicles, marine vessels, and aerial drones. In this context, Visual Simultaneous Localisation and Mapping (VSLAM) - which includes Visual Odometry - relies heavily on the reliable extraction of salient feature points from the visual input data. In this work, we propose an embedded implementation of an unsupervised architecture capable of detecting and describing feature points. It is based on a quantised SuperPoint convolutional neural network. Our objective is to minimise the computational demands of the model while preserving high detection quality, thus facilitating efficient deployment on platforms with limited resources, such as mobile or embedded systems. We implemented the solution on an FPGA System-on-Chip (SoC) platform, specifically the AMD/Xilinx Zynq UltraScale+, where we evaluated the performance of Deep Learning Processing Units (DPUs) and we also used the Brevitas library and the FINN framework to perform model quantisation and hardware-aware optimisation. This allowed us to process 640 x 480 pixel images at up to 54 fps on an FPGA platform, outperforming state-of-the-art solutions in the field. We conducted experiments on the TUM dataset to demonstrate and discuss the impact of different quantisation techniques on the accuracy and performance of the model in a visual odometry task.
准确的位置估计对于部署在自主平台(包括地面车辆、海洋船只和无人机)上的现代导航系统至关重要。在此背景下,视觉同时定位与地图构建 (VSLAM) 依赖于从视觉输入数据中可靠地提取显著特征点,其中包含视觉里程计。本文提出了一种嵌入式实现方案,该方案基于未监督架构,并能够检测和描述特征点。我们的方法采用量化后的 SuperPoint 卷积神经网络。 我们的目标是通过保持高检测质量的同时最小化模型的计算需求,从而在资源有限(如移动或嵌入式系统)平台上实现高效部署。我们在 AMD/Xilinx Zynq UltraScale+ FPGA 系统级芯片 (SoC) 平台上实现了这一解决方案,并评估了深度学习处理单元(DPUs) 的性能。此外,我们使用 Brevitas 库和 FINN 框架进行模型量化及硬件感知优化。这使我们能够在FPGA平台上以高达54 fps的帧率处理640x480像素图像,优于该领域的现有解决方案。 我们在 TUM 数据集上进行了实验,展示了不同量化技术对视觉里程计任务中模型精度和性能的影响,并讨论了这些影响。
https://arxiv.org/abs/2507.07903
Thermal imaging from unmanned aerial vehicles (UAVs) holds significant potential for applications in search and rescue, wildlife monitoring, and emergency response, especially under low-light or obscured conditions. However, the scarcity of large-scale, diverse thermal aerial datasets limits the advancement of deep learning models in this domain, primarily due to the high cost and logistical challenges of collecting thermal data. In this work, we introduce a novel procedural pipeline for generating synthetic thermal images from an aerial perspective. Our method integrates arbitrary object classes into existing thermal backgrounds by providing control over the position, scale, and orientation of the new objects, while aligning them with the viewpoints of the background. We enhance existing thermal datasets by introducing new object categories, specifically adding a drone class in urban environments to the HIT-UAV dataset and an animal category to the MONET dataset. In evaluating these datasets for object detection task, we showcase strong performance across both new and existing classes, validating the successful expansion into new applications. Through comparative analysis, we show that thermal detectors outperform their visible-light-trained counterparts and highlight the importance of replicating aerial viewing angles. Project page: this https URL.
无人飞行器(UAV)上的热成像技术在搜索和救援、野生动物监测以及应急响应等领域具有巨大的应用潜力,尤其是在低光或视线受阻的情况下。然而,由于收集大型多样化热数据集的成本高且物流挑战大,此类深度学习模型的发展受到了限制。在此项工作中,我们介绍了一种用于从空中视角生成合成热图像的新型程序化流程。 我们的方法通过提供对新物体位置、尺寸和方向的控制,在现有的热背景中融合任意对象类别,使这些新加入的对象与背景视点保持一致。我们通过在HIT-UAV数据集中添加无人机类别(特别是在城市环境中)以及向MONET数据集引入动物类别来增强现有热数据集。 评估这些数据集中的目标检测任务时,我们在新旧类别中都展示了强大的性能表现,证明了成功地扩展到新的应用领域。通过对比分析,我们表明热成像探测器优于其可见光训练的对应物,并强调复制空中视角的重要性。 项目页面:[此链接](https://this-url.com/)(原文中的链接请替换为具体地址)。
https://arxiv.org/abs/2507.06797
This article presents a novel stream function-based navigational control system for obstacle avoidance, where obstacles are represented as two-dimensional (2D) rigid surfaces in inviscid, incompressible flows. The approach leverages the vortex panel method (VPM) and incorporates safety margins to control the stream function and flow properties around virtual surfaces, enabling navigation in complex, partially observed environments using real-time sensing. To address the limitations of the VPM in managing relative distance and avoiding rapidly accelerating obstacles at close proximity, the system integrates a model predictive controller (MPC) based on higher-order control barrier functions (HOCBF). This integration incorporates VPM trajectory generation, state estimation, and constraint handling into a receding-horizon optimization problem. The 2D rigid surfaces are enclosed using minimum bounding ellipses (MBEs), while an adaptive Kalman filter (AKF) captures and predicts obstacle dynamics, propagating these estimates into the MPC-HOCBF for rapid avoidance maneuvers. Evaluation is conducted using a PX4-powered Clover drone Gazebo simulator and real-time experiments involving a COEX Clover quadcopter equipped with a 360 degree LiDAR sensor.
本文介绍了一种基于流函数的导航控制系统,用于障碍物规避。在该系统中,障碍物被表示为二维(2D)刚性表面,在无粘性和不可压缩流动环境中运行。此方法利用涡旋面板法(VPM),并结合安全边际来控制虚拟表面上的流函数和流体属性,从而实现在复杂、部分观测环境中的实时感知导航。 为了克服VPM在管理相对距离和避免近距离快速加速障碍物方面的局限性,该系统整合了一种基于高阶控制屏障函数(HOCBF)的模型预测控制器(MPC)。这种集成将涡旋面板方法轨迹生成、状态估计以及约束处理纳入了一个滚动优化问题中。2D刚性表面用最小边界椭圆(MBEs)包围,并通过自适应卡尔曼滤波器(AKF)捕获和预测障碍物动态,这些估算结果进一步输入到MPC-HOCBF中以实现快速规避动作。 该系统的性能评估是通过使用PX4驱动的Clover无人机Gazebo仿真器进行的,并进行了涉及配备360度LiDAR传感器的COEX Clover四旋翼飞机的真实时间实验。
https://arxiv.org/abs/2507.06787
The accurate localization and tracking of dynamic targets, such as equipment, people, vehicles, drones, robots, and the assets that they interact with in GPS-denied indoor environments is critical to enabling safe and efficient operations in the next generation of spatially aware industrial facilities. This paper presents DragonFly , a 3D localization system of highly dynamic backscatter tags using a single MIMO mmWave radar. The system delivers the first demonstration of a mmWave backscatter system capable of exploiting the capabilities of MIMO radars for the 3D localization of mmID tags moving at high speeds and accelerations at long ranges by introducing a critical Doppler disambiguation algorithm and a fully integrated cross-polarized dielectric lens-based mmID tag consuming a mere 68 uW. DragonFly was extensively evaluated in static and dynamic configurations, including on a flying quadcopter, and benchmarked against multiple baselines, demonstrating its ability to track the positions of multiple tags with a median 3D accuracy of 12 cm at speeds and acceleration on the order of 10 m/s-1 and 4 m/s-2 and at ranges of up to 50 m.
在没有GPS的室内环境中,对动态目标(如设备、人员、车辆、无人机、机器人及其交互资产)进行精确定位和跟踪对于下一代空间感知工业设施的安全高效运行至关重要。本文介绍了DragonFly系统,这是一个使用单个MIMO毫米波雷达高度动态反向散射标签的3D定位系统。该系统首次展示了能够利用MIMO雷达能力为高速度(10 m/s-1)及高加速度(4 m/s-2)下长距离移动的毫米级标识(mmID)标签进行三维定位的毫米波反向散射系统,并引入了一个关键的多普勒模糊解析算法和一个完全集成的交指极化介质透镜基mmID标签,该标签仅消耗68微瓦功率。DragonFly系统在静态和动态配置中进行了广泛的测试,包括用于飞行四旋翼无人机的应用场景,并与多个基准进行了对比,展示了其能够以中位3D精度12厘米追踪多个标签位置的能力,在速度、加速度及长达50米的距离范围内均表现出色。
https://arxiv.org/abs/2507.04602
Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc's robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.
相机重定位是现代计算机视觉领域的关键技术之一,它能准确地从图像中确定出相机的位置和姿态(6自由度),在增强现实(AR)、混合现实(MR)、自动驾驶、送货无人机以及机器人导航等领域有着重要的应用价值。不同于传统的基于深度学习的方法通过单场景图像回归相机位姿,这些方法往往缺乏多样环境下的泛化能力和鲁棒性,我们提出了一种名为MVL-Loc的新颖端到端多场景6自由度相机重定位框架。 MVL-Loc利用了视觉语言模型(VLM)预先训练的世界知识,并结合多种模态的数据来实现室内和室外环境下的一致泛化。此外,自然语言被用作引导多场景学习过程的工具,有助于复杂场景的语义理解以及物体间空间关系的捕捉。在7Scenes和Cambridge Landmarks数据集上的大量实验表明,MVL-Loc具备强大的鲁棒性和业界领先的性能,在真实世界的多场景相机重定位任务中实现了位置和姿态估计精度的提升。
https://arxiv.org/abs/2507.04509
The widespread use of consumer drones has introduced serious challenges for airspace security and public safety. Their high agility and unpredictable motion make drones difficult to track and intercept. While existing methods focus on detecting current positions, many counter-drone strategies rely on forecasting future trajectories and thus require more than reactive detection to be effective. To address this critical gap, we propose an unsupervised vision-based method for predicting the three-dimensional trajectories of drones. Our approach first uses an unsupervised technique to extract drone trajectories from raw LiDAR point clouds, then aligns these trajectories with camera images through motion consistency to generate reliable pseudo-labels. We then combine kinematic estimation with a visual Mamba neural network in a self-supervised manner to predict future drone trajectories. We evaluate our method on the challenging MMAUD dataset, including the V2 sequences that feature wide-field-of-view multimodal sensors and dynamic UAV motion in urban scenes. Extensive experiments show that our framework outperforms supervised image-only and audio-visual baselines in long-horizon trajectory prediction, reducing 5-second 3D error by around 40 percent without using any manual 3D labels. The proposed system offers a cost-effective, scalable alternative for real-time counter-drone deployment. All code will be released upon acceptance to support reproducible research in the robotics community.
消费者无人机的广泛使用为航空安全和公共安全带来了严重的挑战。由于其高机动性和不可预测的动作,无人机难以被追踪和拦截。尽管现有方法主要集中在检测当前位置上,许多反无人机策略却依赖于对未来轨迹的预测,并因此需要超越被动反应式的检测技术才能有效实施。为了弥补这一关键缺口,我们提出了一种基于无监督视觉的方法来预测无人机的三维轨迹。我们的方法首先使用一种无监督的技术从原始LiDAR点云中提取无人机轨迹,然后通过运动一致性将这些轨迹与相机图像对齐以生成可靠的伪标签。接着,我们将动力学估计与视觉Mamba神经网络结合,在自我监督的方式下预测未来的无人机轨迹。 我们在具有挑战性的MMAUD数据集上评估了我们的方法,该数据集包括V2序列,这些序列包含广角多模态传感器和城市环境中动态的UAV运动。广泛的实验表明,我们的框架在长时域轨迹预测方面优于仅基于图像的监督式基础模型以及音频-视觉基础模型,在不使用任何手动3D标签的情况下将5秒内的三维误差减少了大约40%。 所提出的系统提供了一种成本效益高、可扩展的选择,适用于实时反无人机部署。所有代码将在接受后发布,以支持机器人领域的可重复研究。
https://arxiv.org/abs/2507.03365
We present VISTA (Viewpoint-based Image selection with Semantic Task Awareness), an active exploration method for robots to plan informative trajectories that improve 3D map quality in areas most relevant for task completion. Given an open-vocabulary search instruction (e.g., "find a person"), VISTA enables a robot to explore its environment to search for the object of interest, while simultaneously building a real-time semantic 3D Gaussian Splatting reconstruction of the scene. The robot navigates its environment by planning receding-horizon trajectories that prioritize semantic similarity to the query and exploration of unseen regions of the environment. To evaluate trajectories, VISTA introduces a novel, efficient viewpoint-semantic coverage metric that quantifies both the geometric view diversity and task relevance in the 3D scene. On static datasets, our coverage metric outperforms state-of-the-art baselines, FisherRF and Bayes' Rays, in computation speed and reconstruction quality. In quadrotor hardware experiments, VISTA achieves 6x higher success rates in challenging maps, compared to baseline methods, while matching baseline performance in less challenging maps. Lastly, we show that VISTA is platform-agnostic by deploying it on a quadrotor drone and a Spot quadruped robot. Open-source code will be released upon acceptance of the paper.
我们介绍了VISTA(基于语义任务感知的视角图像选择方法),这是一种机器人主动探索的方法,用于规划能够提高与任务完成最相关区域3D地图质量的信息轨迹。给定一个开放词汇搜索指令(例如,“找到一个人”),VISTA使机器人能够在环境中进行探索以寻找感兴趣的物体,同时实时构建场景的语义3D高斯点云重建。机器人通过规划递减时间范围内的轨迹在环境内导航,这些轨迹优先考虑与查询的语义相似性和对未见区域的探索。为了评估轨迹,VISTA引入了一种新颖且高效的视角-语义覆盖度量方法,该方法量化了3D场景中的几何视图多样性及任务相关性。 在静态数据集上,我们的覆盖率指标在计算速度和重建质量方面均优于最先进的基准方法FisherRF和Bayes' Rays。在四旋翼飞行器硬件实验中,在具有挑战性的地图环境下,VISTA相比基线方法成功率提高了6倍;而在相对不具挑战性的地图环境中,其性能则与基线方法相当。 最后,我们展示了VISTA的平台无关性,通过将其部署到四旋翼无人机和Spot四足机器人上进行验证。论文被接受后将发布开源代码。
https://arxiv.org/abs/2507.01125
Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at this https URL.
无人机(UAV)目标检测在交通管理、农业和紧急救援等领域得到了广泛应用,但面临着遮挡、小尺寸物体以及不规则形状等挑战。这些挑战凸显了开发一种稳健且高效的多模态UAV目标检测方法的必要性。Mamba在多模态图像融合方面展现出了巨大的潜力。基于此,我们提出了一种基于Mamba架构的多模态UAV目标检测框架——UAVD-Mamba。为了增强几何适应性,我们提出了可变形标记Mamba块(DTMB),通过将可变形卷积和普通卷积产生的自适应补丁与正常补丁相结合来生成可变形令牌,作为输入传递给Mamba块。 为优化多模态特征互补性,我们为RGB和红外(IR)模式设计了两个独立的DTMB,并将这两个DTMB输出融合到Mamba块中用于特征提取,在融合Mamba块中进行特征融合。为了改进对不同尺度物体的检测性能,特别是小尺寸物体,我们在不同的尺度上堆叠四个DTMB以生成多尺度特征表示,然后将其传递给基于YOLO系列启发设计的Detection Neck for Mamba(DNM)模块。DNM模块通过修改YOLOv11中的SPPF和C3K2来更好地处理多尺度特性,并在DTMB前使用跨增强空间注意,在融合Mamba块后引入通道间注意力,以提取更具判别性的特征。 在DroneVehicle数据集上的实验结果显示,我们的方法比基准的OAFA方法在mAP度量上提高了3.6%。代码将在以下链接中发布:[URL]。
https://arxiv.org/abs/2507.00849
Unmanned Aerial Vehicle-based Object Detection (UAV-OD) faces substantial challenges, including small target sizes, high-density distributions, and cluttered backgrounds in UAV imagery. Current algorithms often depend on hand-crafted components like anchor boxes, which demand fine-tuning and exhibit limited generalization, and Non-Maximum Suppression (NMS), which is threshold-sensitive and prone to misclassifying dense objects. These generic architectures thus struggle to adapt to aerial imaging characteristics, resulting in performance limitations. Moreover, emerging end-to-end frameworks have yet to effectively mitigate these aerial-specific this http URL address these issues, we propose HEGS-DETR, a comprehensively enhanced, real-time Detection Transformer framework tailored for UAVs. First, we introduce the High-Frequency Enhanced Semantics Network (HFESNet) as a novel backbone. HFESNet preserves critical high-frequency spatial details to extract robust semantic features, thereby improving discriminative capability for small and occluded targets in complex backgrounds. Second, our Efficient Small Object Pyramid (ESOP) strategy strategically fuses high-resolution feature maps with minimal computational overhead, significantly boosting small object detection. Finally, the proposed Selective Query Recollection (SQR) and Geometry-Aware Positional Encoding (GAPE) modules enhance the detector's decoder stability and localization accuracy, effectively optimizing bounding boxes and providing explicit spatial priors for dense scenes. Experiments on the VisDrone dataset demonstrate that HEGS-DETR achieves a 5.1\% AP$_{50}$ and 3.8\% AP increase over the baseline, while maintaining real-time speed and reducing parameter count by 4M.
基于无人飞行器(UAV)的对象检测(UAV-OD)面临着诸多挑战,包括无人机图像中小目标的尺寸、高密度分布以及复杂背景等问题。目前的算法通常依赖于手工设计的组件如锚框(anchor boxes),这些组件需要精细调整且泛化能力有限,并使用非极大值抑制(NMS)方法,该方法对阈值敏感且容易在密集对象中发生误分类。因此,通用架构难以适应航空成像的特点,导致性能受限。此外,新兴的一端到另一端(end-to-end)框架尚未有效解决这些特定于空中的问题。 为了解决这些问题,我们提出了HEGS-DETR,这是一个全面增强的实时检测变换器框架,专为无人机设计。首先,我们引入了高频增强语义网络(HFESNet)作为新的骨干网。HFESNet保留关键的高频率空间细节以提取强大的语义特征,从而提高了对复杂背景中小目标和被遮挡目标的识别能力。 其次,我们的高效小物体金字塔(ESOP)策略战略性地融合了高分辨率特征图,并且计算开销极低,显著提升了小对象检测性能。 最后,我们提出的查询选择性回忆(SQR)模块和几何感知位置编码(GAPE)模块增强了解码器的稳定性和定位精度,有效优化了边界框,并为密集场景提供了明确的空间先验信息。 在VisDrone数据集上的实验表明,与基线相比,HEGS-DETR实现了5.1% AP$_{50}$和3.8% AP的性能提升,同时保持实时速度并减少了4M参数量。
https://arxiv.org/abs/2507.00825
We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Consequently, enabling localization on low-LoD city models could unlock drones' potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air. Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution. Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km , along with real RGB queries and ground-truth pose annotations. Experimental results show that LoD-Loc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors.
我们提出了一种针对低细节层次(Level of Detail,简称 LoD)城市模型的空中视觉定位的新方法。先前基于线框对齐的方法LoD-Loc已经展示了利用LoD模型进行定位的有希望的结果。然而,LoD-Loc主要依赖于高LoD(如LoD3或LoD2)的城市模型,而大多数现有可用模型以及许多国家计划在全国范围内构建的模型则是低LoD(LoD1)。因此,能够在低LoD城市模型上实现定位可以使无人机在全球范围内的城市定位中发挥更大的潜力。为了解决这些问题,我们推出了LoD-Loc v2,该方法采用了一种从粗到细的策略,并通过显式的轮廓对齐来实现在空中对低LoD城市模型的精确定位。 具体而言,在给定查询图像时,LoD-Loc v2首先使用建筑物分割网络生成建筑轮廓。然后在粗略姿态选择阶段,我们通过对先验姿态周围均匀采样姿态假设构建姿态成本体积以表示姿态概率分布。这个体积中的每个代价衡量投影和预测轮廓之间的对齐程度。我们将具有最大值的姿态选作粗略姿态。在精细姿态估计阶段,采用了一种融合多束跟踪方法的粒子滤波技术来高效地探索假设空间并获得最终的姿态估计。 为了进一步促进该领域的研究,我们发布了两个包含低LoD(LoD1)城市模型的数据集,这些数据集覆盖了总计10.7公里的区域,并附带真实的RGB查询图像和姿态真值标注。实验结果显示,与现有的最先进的基准方法相比,LoD-Loc v2不仅在高细节层次的城市模型上提高了估计精度,在低LoD模型上的定位能力也首次得以实现。此外,它大幅超越了基于纹理模型的方法,扩大了收敛范围以适应更大的先验误差。
https://arxiv.org/abs/2507.00659
Detecting individual tree crowns in tropical forests is essential to study these complex and crucial ecosystems impacted by human interventions and climate change. However, tropical crowns vary widely in size, structure, and pattern and are largely overlapping and intertwined, requiring advanced remote sensing methods applied to high-resolution imagery. Despite growing interest in tropical tree crown detection, annotated datasets remain scarce, hindering robust model development. We introduce SelvaBox, the largest open-access dataset for tropical tree crown detection in high-resolution drone imagery. It spans three countries and contains more than 83,000 manually labeled crowns - an order of magnitude larger than all previous tropical forest datasets combined. Extensive benchmarks on SelvaBox reveal two key findings: (1) higher-resolution inputs consistently boost detection accuracy; and (2) models trained exclusively on SelvaBox achieve competitive zero-shot detection performance on unseen tropical tree crown datasets, matching or exceeding competing methods. Furthermore, jointly training on SelvaBox and three other datasets at resolutions from 3 to 10 cm per pixel within a unified multi-resolution pipeline yields a detector ranking first or second across all evaluated datasets. Our dataset, code, and pre-trained weights are made public.
在热带森林中检测单个树木冠层对于研究这些受人类干预和气候变化影响的复杂且关键生态系统至关重要。然而,由于树冠大小、结构和模式的变化较大,并且彼此重叠交错严重,因此需要将先进的遥感技术应用于高分辨率图像上进行处理。尽管对热带树冠检测的兴趣日益增长,但标注数据集仍然稀缺,这阻碍了模型开发的稳定性。我们推出了SelvaBox,这是用于在高分辨率无人机影像中识别热带树木冠层的最大公开访问数据集。它涵盖了三个国家,并包含超过83,000个手动标记的树冠——比所有之前结合的热带森林数据集大了一个数量级。对SelvaBox进行广泛基准测试后发现两个关键结论:(1)更高的分辨率输入一致提高了检测准确率;和(2)仅在SelvaBox上训练的模型在未知的热带树木冠层数据集中实现了具有竞争力的零样本检测性能,与现有方法相比,其效果相当或更优。此外,在一个统一的多分辨率管道中同时使用SelvaBox和其他三个数据集进行训练,并且这些数据集的空间分辨率从3厘米到10厘米不等,可以生成在所有评估的数据集中排名第一或第二的检测器。我们的数据集、代码和预训练权重已经对外公开发布。
https://arxiv.org/abs/2507.00170
The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge. Existing approaches often prioritize inference speed, leading to degraded performance when handling multi-modal inputs. To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. Specifically, we introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.
无人飞行器(UAV)的迅速普及凸显了在各种空中场景中进行稳健且高效的物体检测的重要性。然而,在复杂条件下检测小型目标仍然是一项重大挑战。现有方法通常优先考虑推理速度,这导致处理多模态输入时性能下降。为了解决这个问题,我们提出了DGE-YOLO,这是一种增强版的YOLO(You Only Look Once)检测框架,旨在有效地融合多模态信息。具体而言,我们引入了一种双分支架构来专门提取每种模式的特征,使模型能够处理红外和可见光图像。为了进一步丰富语义表示,我们提出了一种高效的多尺度注意力机制(EMA),该机制增强了跨空间尺度的特性学习能力。此外,我们将传统的颈部分替换为收集与分配模块,以减轻在特征聚合过程中信息丢失的问题。 通过对无人机车辆数据集进行广泛的实验,DGE-YOLO表现出优于当前最佳方法的性能,证明了其在多模态UAV物体检测任务中的有效性。
https://arxiv.org/abs/2506.23252
Perception-related tasks often arise in autonomous systems operating under partial observability. This work studies the problem of synthesizing optimal policies for complex perception-related objectives in environments modeled by partially observable Markov decision processes. To formally specify such objectives, we introduce \emph{co-safe linear inequality temporal logic} (sc-iLTL), which can define complex tasks that are formed by the logical concatenation of atomic propositions as linear inequalities on the belief space of the POMDPs. Our solution to the control synthesis problem is to transform the \mbox{sc-iLTL} objectives into reachability objectives by constructing the product of the belief MDP and a deterministic finite automaton built from the sc-iLTL objective. To overcome the scalability challenge due to the product, we introduce a Monte Carlo Tree Search (MCTS) method that converges in probability to the optimal policy. Finally, a drone-probing case study demonstrates the applicability of our method.
与自主系统在部分可观测性环境下运行时出现的感知相关任务有关,这项工作研究了如何为部分可观测马尔可夫决策过程(POMDP)模型环境中复杂的感知目标合成最优策略的问题。为了正式指定此类目标,我们引入了**co-safe线性不等式时态逻辑**(sc-iLTL),它可以通过在POMDP信念空间上的线性不等式来定义复杂任务,这些任务由原子命题的逻辑串联组成。 我们的控制综合方法是通过构建信念MDP与从sc-iLTL目标构造出的确定有限自动机(DFA)的产品,将sc-iLTL目标转化为可达性目标。为了克服因产品构造而导致的可扩展性挑战,我们引入了一种蒙特卡洛树搜索(MCTS)方法,该方法可以以概率收敛到最优策略。 最后,通过一个无人机探测案例研究展示了我们这种方法的应用可行性。
https://arxiv.org/abs/2507.02942
Drones can inspect overhead power lines while they remain energized, significantly simplifying the inspection process. However, localizing a drone relative to all conductors using an onboard LiDAR sensor presents several challenges: (1) conductors provide minimal surface for LiDAR beams limiting the number of conductor points in a scan, (2) not all conductors are consistently detected, and (3) distinguishing LiDAR points corresponding to conductors from other objects, such as trees and pylons, is difficult. This paper proposes an estimation approach that minimizes the error between LiDAR measurements and a single geometric model representing the entire conductor array, rather than tracking individual conductors separately. Experimental results, using data from a power line drone inspection, demonstrate that this method achieves accurate tracking, with a solver converging under 50 ms per frame, even in the presence of partial observations, noise, and outliers. A sensitivity analysis shows that the estimation approach can tolerate up to twice as many outlier points as valid conductors measurements.
无人机可以在线路带电的情况下检查架空输电线,大大简化了检查流程。然而,使用机载激光雷达(LiDAR)传感器来确定无人机相对于所有导线的位置面临着诸多挑战:(1)导线提供的反射表面有限,导致扫描中导线点的数量较少;(2)并非所有的导线都能被一致地检测到;(3)区分与树木和铁塔等其他物体相对应的激光雷达数据点非常困难。本文提出了一种估算方法,该方法通过最小化激光雷达测量值与整个导线组的一个几何模型之间的误差来工作,而不是单独跟踪每根导线。实验结果表明,使用电力线路无人机检查的数据证明了这种方法可以实现准确追踪,在处理部分观测、噪声和异常值的情况下,求解器在每帧内的收敛时间小于50毫秒。敏感性分析显示,该估算方法能够容忍多达两倍于有效导线测量点的异常数据点数量。
https://arxiv.org/abs/2506.20812
External factors, including urban canyons and adversarial interference, can lead to Global Positioning System (GPS) inaccuracies that vary as a function of the position in the environment. This study addresses the challenge of estimating a static, spatially-varying error function using a team of robots. We introduce a State Bias Estimation Algorithm (SBE) whose purpose is to estimate the GPS biases. The central idea is to use sensed estimates of the range and bearing to the other robots in the team to estimate changes in bias across the environment. A set of drones moves in a 2D environment, each sampling data from GPS, range, and bearing sensors. The biases calculated by the SBE at estimated positions are used to train a Gaussian Process Regression (GPR) model. We use a Sparse Gaussian process-based Informative Path Planning (IPP) algorithm that identifies high-value regions of the environment for data collection. The swarm plans paths that maximize information gain in each iteration, further refining their understanding of the environment's positional bias landscape. We evaluated SBE and IPP in simulation and compared the IPP methodology to an open-loop strategy.
外部因素,如城市峡谷和对抗性干扰,会导致全球定位系统(GPS)的准确性随环境中的位置变化而波动。本研究旨在通过使用一组机器人团队来估计静态且空间分布不均匀的误差函数。我们引入了状态偏差估计算法(SBE),其目的是估算GPS偏置。核心思想是利用对其他机器人距离和方位的角度传感器数据,来估计环境中偏置的变化。 无人机群在一个二维环境内移动,每个无人机从GPS、测距和方向传感器中采集数据。通过在估计的位置上使用SBE算法计算出的偏置值,用于训练高斯过程回归(GPR)模型。我们采用了一种基于稀疏高斯过程的信息路径规划(IPP)算法,该算法能识别环境中的关键区域进行数据收集。群体计划路径以最大化每次迭代中获得的信息量,进一步细化他们对环境中位置偏差景观的理解。 我们在模拟环境中评估了SBE和IPP,并将IPP方法与开环策略进行了比较。
https://arxiv.org/abs/2506.19712
While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at this https URL.
尽管多车辆协作驾驶展示了相对于单一车辆自主性而言的明显优势,但传统的基于基础设施的车联网(V2X)系统仍然受限于高昂的部署成本以及在农村和郊区出现的“未覆盖危险区域”。我们提出了一种名为AirV2X-感知的大规模数据集,该数据集利用无人驾驶航空器(UAVs)作为固定路边单元(RSUs)的一种灵活替代方案或补充。无人机相较于地面传感器具有独特的优势:能够提供互补的鸟瞰视角以减少遮挡;具备动态定位能力,可以实现悬停、巡逻和引导导航规则;并且相比固定基础设施部署成本显著降低。 我们的数据集包括了6.73小时由无人机辅助驾驶的情景,在城市、郊区及农村环境中,涵盖了多种天气和光照条件。AirV2X-感知数据集促进了车辆到无人机(V2D)算法的开发与标准化评估,弥补了迅速发展的空中辅助自动驾驶系统领域中的一个关键空白。 该数据集和开发工具包已开源发布在以下网址:[此处插入具体URL]。
https://arxiv.org/abs/2506.19283
Automated detection of small and rare wildlife in aerial imagery is crucial for effective conservation, yet remains a significant technical challenge. Prairie dogs exemplify this issue: their ecological importance as keystone species contrasts sharply with their elusive presence--marked by small size, sparse distribution, and subtle visual features--which undermines existing detection approaches. To address these challenges, we propose RareSpot, a robust detection framework integrating multi-scale consistency learning and context-aware augmentation. Our multi-scale consistency approach leverages structured alignment across feature pyramids, enhancing fine-grained object representation and mitigating scale-related feature loss. Complementarily, context-aware augmentation strategically synthesizes challenging training instances by embedding difficult-to-detect samples into realistic environmental contexts, significantly boosting model precision and recall. Evaluated on an expert-annotated prairie dog drone imagery benchmark, our method achieves state-of-the-art performance, improving detection accuracy by over 35% compared to baseline methods. Importantly, it generalizes effectively across additional wildlife datasets, demonstrating broad applicability. The RareSpot benchmark and approach not only support critical ecological monitoring but also establish a new foundation for detecting small, rare species in complex aerial scenes.
在航空影像中自动检测小型和稀有野生动物对于有效的保护工作至关重要,然而这仍然是一个重大的技术挑战。草原犬鼠就是一个很好的例子:作为关键物种的生态重要性与它们难以捉摸的存在形成了鲜明对比——由于体型小、分布稀疏以及视觉特征不明显,现有的检测方法无法有效应对这一问题。 为了解决这些难题,我们提出了RareSpot框架,这是一个集成了多尺度一致性学习和情境感知增强技术的强大检测框架。我们的多尺度一致性方法利用了特征金字塔之间的结构化对齐,增强了细粒度对象的表示,并减少了与尺度相关的信息损失。同时,情景感知增强策略通过将难以检测到的样本嵌入真实的环境背景中来合成具有挑战性的训练实例,从而显著提升了模型的精确性和召回率。 在专家注释的草原犬鼠无人机影像基准数据集上进行评估后,我们的方法达到了最先进的性能水平,在检测准确度方面比基线方法提高了35%以上。此外,它还能够在其他野生动物数据集中有效推广,显示出广泛的应用性。RareSpot不仅支持关键生态监测工作,而且还为在复杂的航空场景中识别小型和稀有物种建立了新的基础。
https://arxiv.org/abs/2506.19087
Learning-based control approaches like reinforcement learning (RL) have recently produced a slew of impressive results for tasks like quadrotor trajectory tracking and drone racing. Naturally, it is common to demonstrate the advantages of these new controllers against established methods like analytical controllers. We observe, however, that reliably comparing the performance of such very different classes of controllers is more complicated than might appear at first sight. As a case study, we take up the problem of agile tracking of an end-effector for a quadrotor with a fixed arm. We develop a set of best practices for synthesizing the best-in-class RL and geometric controllers (GC) for benchmarking. In the process, we resolve widespread RL-favoring biases in prior studies that provide asymmetric access to: (1) the task definition, in the form of an objective function, (2) representative datasets, for parameter optimization, and (3) feedforward information, describing the desired future trajectory. The resulting findings are the following: our improvements to the experimental protocol for comparing learned and classical controllers are critical, and each of the above asymmetries can yield misleading conclusions. Prior works have claimed that RL outperforms GC, but we find the gaps between the two controller classes are much smaller than previously published when accounting for symmetric comparisons. Geometric control achieves lower steady-state error than RL, while RL has better transient performance, resulting in GC performing better in relatively slow or less agile tasks, but RL performing better when greater agility is required. Finally, we open-source implementations of geometric and RL controllers for these aerial vehicles, implementing best practices for future development. Website and code is available at this https URL
基于学习的控制方法,如强化学习(RL),最近在诸如四旋翼飞行器轨迹跟踪和无人机竞速等任务上取得了令人瞩目的成果。自然地,在展示这些新控制器相对于分析型控制器的优势时,往往会进行对比演示。然而,我们观察到可靠地比较这两种非常不同的控制器类型的表现比乍一看要复杂得多。 作为案例研究,我们探讨了四旋翼飞行器固定臂末端执行器敏捷跟踪的问题,并为此开发了一套最佳实践方法,用于综合最佳的RL和几何控制(GC)控制器以进行基准测试。在此过程中,我们解决了先前研究中的广泛存在的偏向于RL的偏见问题,这些问题提供了非对称访问: 1. 任务定义的形式为目标函数; 2. 代表性数据集,用于参数优化; 3. 描述期望未来轨迹的前馈信息。 结果发现如下:我们在比较学习型控制器和经典控制器方面的实验方法改进至关重要,并且上述每种不对称情况都可能导致误导性结论。先前的研究曾声称RL优于GC,但当我们考虑对称对比时,这两种控制类别的差距远比之前发布的要小得多。几何控制在稳态误差方面低于RL,而RL在瞬态性能上表现更好,这导致GC在相对较慢或敏捷度要求较低的任务中表现出色,而在需要更高敏捷性的情况下则不如RL。 最后,我们开源了这些空中飞行器的几何和RL控制器的实现,并为未来的开发实施最佳实践。该网站和代码可在以下网址获得:[https URL](实际操作时请替换为正确的URL)。
https://arxiv.org/abs/2506.17832
Unsupervised Domain Adaptation (UDA) is a critical challenge in real-world vision systems, especially in resource-constrained environments like drones, where memory and computation are limited. Existing prompt-driven UDA methods typically rely on large vision-language models and require full access to source-domain data during adaptation, limiting their applicability. In this work, we propose Prmpt2Adpt, a lightweight and efficient zero-shot domain adaptation framework built around a teacher-student paradigm guided by prompt-based feature alignment. At the core of our method is a distilled and fine-tuned CLIP model, used as the frozen backbone of a Faster R-CNN teacher. A small set of low-level source features is aligned to the target domain semantics-specified only through a natural language prompt-via Prompt-driven Instance Normalization (PIN). These semantically steered features are used to briefly fine-tune the detection head of the teacher model. The adapted teacher then generates high-quality pseudo-labels, which guide the on-the-fly adaptation of a compact student model. Experiments on the MDS-A dataset demonstrate that Prmpt2Adpt achieves competitive detection performance compared to state-of-the-art methods, while delivering up to 7x faster adaptation and 5x faster inference speed using few source images-making it a practical and scalable solution for real-time adaptation in low-resource domains.
无监督领域适应(UDA)是现实世界视觉系统中的一个关键挑战,特别是在资源受限的环境中如无人机中,因为这些环境具有有限的记忆和计算能力。现有的基于提示驱动的UDA方法通常依赖于大型视觉-语言模型,并且在适应过程中需要完全访问源域数据,这限制了它们的应用范围。在这项工作中,我们提出了Prmpt2Adpt,这是一种轻量级且高效的零样本领域自适应框架,围绕由基于提示的功能对齐指导的教师-学生范式构建。我们的方法的核心是一个经过蒸馏和微调的CLIP模型,用作冻结后的Faster R-CNN教师模型的骨干网络。一组小规模的低级别源特征通过Prompt驱动实例归一化(PIN)与仅通过自然语言提示指定的目标域语义对齐。这些被语义引导的功能用于短时间微调教师模型的检测头。适应后的教师然后生成高质量的伪标签,指导紧凑的学生模型进行在飞行自适应。在MDS-A数据集上的实验表明,Prmpt2Adpt实现了与最先进的方法相比具有竞争力的目标检测性能,同时使用少量源图像时提供高达7倍更快的自适应速度和5倍更快的推理速度,使其成为低资源领域中实时自适应的实际且可扩展解决方案。
https://arxiv.org/abs/2506.16994
Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security. To improve robustness, recent studies have explored multimodal detection by fusing visible (RGB) and infrared (IR) imagery. However, due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment. This introduces two major challenges: semantic inconsistency at corresponding spatial locations and modality conflict during feature fusion. Existing methods often address these issues in isolation, limiting their effectiveness. In this paper, we propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF), a unified framework that jointly tackles both challenges in weakly aligned UAV-based object detection. CoDAF comprises two novel modules: the Offset-guided Semantic Alignment (OSA), which estimates attention-based spatial offsets and uses deformable convolution guided by a shared semantic space to align features more precisely; and the Dynamic Attention-guided Fusion Module (DAFM), which adaptively balances modality contributions through gating and refines fused features via spatial-channel dual attention. By integrating alignment and fusion in a unified design, CoDAF enables robust UAV object detection. Experiments on standard benchmarks validate the effectiveness of our approach, with CoDAF achieving a mAP of 78.6% on the DroneVehicle dataset.
无人飞行器(UAV)对象检测在环境监测和城市安全等领域扮演着至关重要的角色。为了增强其鲁棒性,最近的研究探索了通过融合可见光(RGB)和红外(IR)图像来实现多模态检测的方法。然而,由于无人机平台的运动以及异步成像的原因,不同模式之间经常会出现空间错位现象,导致对齐度较弱。这引出了两个主要挑战:对应空间位置上的语义不一致性以及在特征融合过程中产生的模态冲突。现有的方法通常单独解决这些问题,从而限制了它们的有效性。 本文提出了一种统一框架——跨模态偏移引导动态对齐与融合(CoDAF),旨在同时应对弱对齐的无人机对象检测中的两大挑战。CoDAF包含两个创新模块:偏移引导语义对齐(OSA)和动态注意力引导融合模块(DAFM)。其中,OSA通过基于共享语义空间估算注意力驱动的空间偏移,并使用可变形卷积来实现更精确的特征对齐;而DAFM则通过门控机制自适应地平衡模态贡献,并通过空间-通道双重注意机制优化融合后的特征。通过将对齐与融合整合到统一的设计中,CoDAF实现了鲁棒的无人机对象检测。 在标准基准测试上的实验验证了我们方法的有效性,其中CoDAF在DroneVehicle数据集上达到了78.6%的平均精度(mAP)。
https://arxiv.org/abs/2506.16737