Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at this https URL
实时在边缘设备上追踪小型无人驾驶飞行器(UAV)面临着基本的分辨率-速度冲突。将高分辨率图像降采样至标准检测模型输入尺寸会导致小目标特征消失到不可检出阈值以下。然而,在资源受限平台上处理原始1080p帧无法达到平稳云台控制所需的足够吞吐量。为此,我们提出了SDG-Track,即稀疏检测引导追踪器,采用观察者-跟随者架构来解决这一冲突。观察者流在GPU上以低频运行高容量检测器,从1920x1080帧中提供准确的位置锚点。跟随者流则通过CPU上的感兴趣区域(ROI)约束稀疏光流,在高频下进行轨迹插值。 为了处理由遮挡或与目标具有相似光谱特性的干扰物导致的追踪失败,我们引入了双空间恢复机制——这是一种无需训练的重新获取手段,结合颜色直方图匹配和几何一致性约束。在地面至空中跟踪站上的实验表明,SDG-Track能够在保持97.2%的帧间检测精度的同时实现35.1 FPS的系统吞吐量。该系统成功地在现实世界操作条件下,在NVIDIA Jetson Orin Nano上追踪敏捷的第一人称视角(FPV)无人机。 我们的研究代码可公开获取,详情请访问:[提供的链接] (请将 [提供的链接] 替换为实际的网址)
https://arxiv.org/abs/2512.04883
Drones are becoming indispensable in many application domains. In data-driven missions, besides sensing, the drone must process the collected data at runtime to decide whether additional action must be taken on the spot, before moving to the next point of interest. If processing does not reveal an event or situation that requires such an action, the drone has waited in vain instead of moving to the next point. If, however, the drone starts moving to the next point and it turns out that a follow-up action is needed at the previous point, it must spend time to fly-back. To take this decision, we propose different machine-learning methods based on branch prediction and reinforcement learning. We evaluate these methods for a wide range of scenarios where the probability of event occurrence changes with time. Our results show that the proposed methods consistently outperform the regression-based method proposed in the literature and can significantly improve the worst-case mission time by up to 4.1x. Also, the achieved median mission time is very close, merely up to 2.7% higher, to that of a method with perfect knowledge of the current underlying event probability at each point of interest.
无人机在许多应用领域中变得不可或缺。在数据驱动的任务中,除了收集传感器数据外,无人机还必须在运行时处理这些数据以决定是否需要在现场采取进一步行动,然后再前往下一个兴趣点。如果处理结果没有发现需要此类操作的事件或情况,则无人机本可以继续前往下一个兴趣点,却在此处浪费了时间等待。然而,若无人机开始前往下一个兴趣点后才发现需要返回上一个兴趣点进行后续操作,则必须花费更多的时间飞回原地。 为了做出这种决策,我们提出了基于分支预测和强化学习的多种机器学习方法,并在一系列广泛场景中对其进行了评估——这些场景中的事件发生概率会随时间发生变化。我们的研究结果表明,所提出的方法在各种情况下始终优于文献中提出的回归法,并且可以显著减少最糟糕的任务执行时间(最多提高4.1倍)。此外,达成的中位任务执行时间也非常接近于一种假想情况下的方法——即无人机能够在每个兴趣点都准确知道当前事件发生的概率。
https://arxiv.org/abs/2512.04773
Traditional high-quality 3D scanning and reconstruction typically relies on human labor to plan the scanning procedure. With the rapid development of embodied systems such as drones and robots, there is a growing demand of performing accurate 3D scanning and reconstruction in an fully automated manner. We introduce Auto3R, a data-driven uncertainty quantification model that is designed to automate the 3D scanning and reconstruction of scenes and objects, including objects with non-lambertian and specular materials. Specifically, in a process of iterative 3D reconstruction and scanning, Auto3R can make efficient and accurate prediction of uncertainty distribution over potential scanning viewpoints, without knowing the ground truth geometry and appearance. Through extensive experiments, Auto3R achieves superior performance that outperforms the state-of-the-art methods by a large margin. We also deploy Auto3R on a robot arm equipped with a camera and demonstrate that Auto3R can be used to effectively digitize real-world 3D objects and delivers ready-to-use and photorealistic digital assets. Our homepage: this https URL .
传统高质量的三维扫描和重建通常依赖于人工规划扫描流程。随着无人机和机器人等具身系统的快速发展,人们对准确且全自动化的三维扫描及重建的需求日益增长。我们介绍了一种名为Auto3R的数据驱动不确定性量化模型,该模型旨在自动化场景与物体(包括具有非朗伯特和镜面材料的物体)的三维扫描和重建过程。具体而言,在迭代式三维重构和扫描的过程中,Auto3R能够在不知悉真实几何结构及外观的情况下,高效且准确地预测潜在扫描视角上的不确定性分布。 通过广泛的实验,Auto3R表现出超越现有最佳方法的卓越性能。我们还在配备摄像头的机器人手臂上部署了Auto3R,并证明它可以有效数字化现实世界的三维物体,并生成即用型和高度逼真的数字资产。我们的主页:[此处插入具体链接]。
https://arxiv.org/abs/2512.04528
Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at this https URL.
跨视角地理定位通过检索与查询图像视觉上对应的地理标记参考图像来推断位置。然而,传统的以卫星为中心的范式在高分辨率或最新卫星影像不可用时会限制其鲁棒性,并且未能充分利用不同视角(如无人机、卫星和街景)以及模态(如语言和图像)之间的互补线索。为了解决这些挑战,我们提出了GeoBridge,这是一种基础模型,它可以在不同的视图之间进行双向匹配并支持从语言到图像的检索。超越传统的以卫星为中心的形式化方法,GeoBridge建立了一种新颖的语义锚机制,通过文本描述将多视角特征联系起来,从而实现稳健和灵活的位置定位。 为了支持这一任务,我们构建了GeoLoc,这是首个大规模、跨模态、多视图对齐的数据集,包含来自36个国家超过50,000对无人机、街景全景和卫星图像及其文本描述,确保地理和语义上的对齐。我们在多个任务上进行了广泛的评估。实验结果确认了GeoLoc预训练显著提高了GeoBridge的地理定位准确性,并促进了跨域泛化和跨模态知识转移。 数据集、源代码以及预训练模型可在[该网址](请将URL替换为实际链接)获取。
https://arxiv.org/abs/2512.02697
We propose a methodology for falsifying safety properties in robotic vehicle systems through property-guided reduction and surrogate execution. By isolating only the control logic and physical dynamics relevant to a given specification, we construct lightweight surrogate models that preserve property-relevant behaviors while eliminating unrelated system complexity. This enables scalable falsification via trace analysis and temporal logic oracles. We demonstrate the approach on a drone control system containing a known safety flaw. The surrogate replicates failure conditions at a fraction of the simulation cost, and a property-guided fuzzer efficiently discovers semantic violations. Our results suggest that controller reduction, when coupled with logic-aware test generation, provides a practical and scalable path toward semantic verification of cyber-physical systems.
我们提出了一种通过属性引导的简化和替代执行来验证机器人车辆系统安全性质的方法。通过对给定规范相关的控制逻辑和物理动态进行隔离,我们构建了轻量级的替代模型,在保留与属性相关的行为的同时消除无关的系统复杂性。这使得可以通过轨迹分析和时态逻辑预言来进行可扩展的反证检验。我们在一个含有已知安全缺陷的无人机控制系统上展示了这种方法的应用。该替代模型以较低的模拟成本复制了故障条件,并且通过属性引导的模糊测试能够高效地发现语义违规行为。我们的研究结果表明,结合逻辑感知的测试生成进行控制器简化可以为网络物理系统的语义验证提供一种实际可行和可扩展的方法。
https://arxiv.org/abs/2512.02270
The field of 360-degree omnidirectional understanding has been receiving increasing attention for advancing spatial intelligence. However, the lack of large-scale and diverse data remains a major limitation. In this work, we propose AirSim360, a simulation platform for omnidirectional data from aerial viewpoints, enabling wide-ranging scene sampling with drones. Specifically, AirSim360 focuses on three key aspects: a render-aligned data and labeling paradigm for pixel-level geometric, semantic, and entity-level understanding; an interactive pedestrian-aware system for modeling human behavior; and an automated trajectory generation paradigm to support navigation tasks. Furthermore, we collect more than 60K panoramic samples and conduct extensive experiments across various tasks to demonstrate the effectiveness of our simulator. Unlike existing simulators, our work is the first to systematically model the 4D real world under an omnidirectional setting. The entire platform, including the toolkit, plugins, and collected datasets, will be made publicly available at this https URL.
360度全方位理解领域近年来因推进空间智能而备受关注,然而大规模且多样化的数据缺失依然是主要的限制因素。为此,我们提出了AirSim360这一仿真平台,用于从空中视角获取全方位的数据,并利用无人机进行广泛的场景采样。具体来说,AirSim360着重于三个方面:一种像素级别的几何、语义和实体级理解的相关渲染对齐数据与标注范式;一个考虑行人的互动系统,旨在建模人类行为;以及支持导航任务的自动化轨迹生成范式。此外,我们收集了超过6万张全景样本,并在各种任务上进行了广泛的实验以证明我们的仿真器的有效性。 不同于现有的模拟器,我们的工作首次在一个全方位设置下系统地对4D现实世界进行建模。整个平台包括工具包、插件和所采集的数据集将在此网址公开提供:[https://airsim.github.io/airsim360/](https://airsim.github.io/airsim360/)
https://arxiv.org/abs/2512.02009
Tiny object detection in remote sensing imagery has attracted significant research interest in recent years. Despite recent progress, achieving balanced detection performance across diverse object scales remains a formidable challenge, particularly in scenarios where dense tiny objects and large objects coexist. Although large foundation models have revolutionized general vision tasks, their application to tiny object detection remains unexplored due to the extreme scale variation and density distribution inherent to remote sensing imagery. To bridge this scale gap, we propose ScaleBridge-Det, to the best of our knowledge, the first large detection framework designed for tiny objects, which could achieve balanced performance across diverse scales through scale-adaptive expert routing and density-guided query allocation. Specifically, we introduce a Routing-Enhanced Mixture Attention (REM) module that dynamically selects and fuses scale-specific expert features via adaptive routing to address the tendency of standard MoE models to favor dominant scales. REM generates complementary and discriminative multi-scale representations suitable for both tiny and large objects. Furthermore, we present a Density-Guided Dynamic Query (DGQ) module that predicts object density to adaptively adjust query positions and numbers, enabling efficient resource allocation for objects of varying scales. The proposed framework allows ScaleBridge-Det to simultaneously optimize performance for both dense tiny and general objects without trade-offs. Extensive experiments on benchmark and cross-domain datasets demonstrate that ScaleBridge-Det achieves state-of-the-art performance on AI-TOD-V2 and DTOD, while exhibiting superior cross-domain robustness on VisDrone.
近年来,遥感图像中的微小目标检测引起了广泛的研究兴趣。尽管在这一领域取得了显著进展,但在密集型微小物体和大型物体共存的情况下实现各种尺度上的均衡检测性能仍是一项艰巨的挑战。虽然大规模基础模型已经革新了一般视觉任务,但由于遥感影像中极端的比例变化及密度分布特性,这类模型尚未被应用于微小目标检测。为了弥合这一比例差距,我们提出了ScaleBridge-Det——据我们所知,这是第一个专为微小对象设计的大规模检测框架。该框架通过尺度自适应专家路由和密度引导查询分配来实现各种尺度上的均衡性能。 具体来说,我们引入了增强型混合注意力(REM)模块,该模块能动态选择并融合特定尺度的专家特征,以解决标准MoE模型倾向于青睐主导比例的问题。REM生成适用于微小和大型物体的互补且具有区分性的多尺度表示形式。此外,我们还提出了密度引导动态查询(DGQ)模块,它预测目标密度,并据此自适应调整查询的位置和数量,从而为不同规模的对象高效地分配资源。 所提出的框架使ScaleBridge-Det能够同时优化密集型微小物体和一般对象的性能,而无需进行取舍。在基准数据集及跨域数据集上的广泛实验表明,ScaleBridge-Det在AI-TOD-V2和DTOD上实现了最新的性能,并且在VisDrone数据集中展示了优异的跨领域鲁棒性。
https://arxiv.org/abs/2512.01665
Autonomous aerial tracking with drones offers vast potential for surveillance, cinematography, and industrial inspection applications. While single-drone tracking systems have been extensively studied, swarm-based target tracking remains underexplored, despite its unique advantages of distributed perception, fault-tolerant redundancy, and multidirectional target coverage. To bridge this gap, we propose a novel decentralized LiDAR-based swarm tracking framework that enables visibility-aware, cooperative target tracking in complex environments, while fully harnessing the unique capabilities of swarm systems. To address visibility, we introduce a novel Spherical Signed Distance Field (SSDF)-based metric for 3-D environmental occlusion representation, coupled with an efficient algorithm that enables real-time onboard SSDF updating. A general Field-of-View (FOV) alignment cost supporting heterogeneous LiDAR configurations is proposed for consistent target observation. Swarm coordination is enhanced through cooperative costs that enforce inter-robot safe clearance, prevent mutual occlusions, and notably facilitate 3-D multidirectional target encirclement via a novel electrostatic-potential-inspired distribution metric. These innovations are integrated into a hierarchical planner, combining a kinodynamic front-end searcher with a spatiotemporal $SE(3)$ back-end optimizer to generate collision-free, visibility-optimized this http URL on heterogeneous LiDAR swarms, our fully decentralized implementation features collaborative perception, distributed planning, and dynamic swarm reconfigurability. Validated through rigorous real-world experiments in cluttered outdoor environments, the proposed system demonstrates robust cooperative tracking of agile targets (drones, humans) while achieving superior visibility maintenance.
基于无人机的自主空中跟踪提供了广泛的潜在应用,包括监控、电影拍摄和工业检查。尽管单个无人机跟踪系统已经得到了广泛研究,但群组协作目标跟踪仍然是一个未充分探索的领域,即便它具备分布式感知能力、容错冗余以及多方向目标覆盖的独特优势。为了填补这一空白,我们提出了一种新颖的去中心化LiDAR(激光雷达)群组追踪框架,该框架能够实现在复杂环境中基于可见性感知的协作目标跟踪,并充分利用群组系统的独特功能。 为了解决可见性问题,我们引入了一个全新的球面符号距离场(SSDF)度量标准来表示三维环境中的遮挡情况。此外,还有一种高效的算法支持实时机载SSDF更新。为了确保一致的目标观察效果,我们提出了一种通用视场(FOV)对准成本函数,适用于异构LiDAR配置。 通过合作代价的引入,群组协调得到了增强,这些代价强制机器人之间的安全距离、防止相互遮挡,并且利用一个新颖的静电势能启发式的分布度量显著促进了三维多方向目标包围。所有这些创新都整合到一个分层规划器中,该规划器结合了动态前端搜索和时空$SE(3)$后端优化器来生成无碰撞、视图优化的结果。 在异构LiDAR群组上进行的完全去中心化实现展示了协作感知、分布式计划以及动态群重组的能力。通过严格的现实世界实验验证,在复杂的户外环境中,所提出的系统能够稳健地跟踪敏捷目标(如无人机和人类),并维持出色的可见性维护效果。
https://arxiv.org/abs/2512.01280
Fast, efficient, robust communication during wildfire and other emergency responses is critical. One way to achieve this is by coordinating swarms of autonomous aerial vehicles carrying communications equipment to form an ad-hoc network connecting emergency response personnel to both each other and central command. However, operating in such extreme environments may lead to individual networking agents being damaged or rendered inoperable, which could bring down the network and interrupt communications. To overcome this challenge and enable multi-agent UAV networking in difficult environments, this paper introduces and formalizes the problem of Robust Task Networking Under Attrition (RTNUA), which extends connectivity maintenance in multi-robot systems to explicitly address proactive redundancy and attrition recovery. We introduce Physics-Informed Robust Employment of Multi-Agent Networks ($\Phi$IREMAN), a topological algorithm leveraging physics-inspired potential fields to solve this problem. Through simulation across 25 problem configurations, $\Phi$IREMAN consistently outperforms the DCCRS baseline, and on large-scale problems with up to 100 tasks and 500 drones, maintains $>99.9\%$ task uptime despite substantial attrition, demonstrating both effectiveness and scalability.
在野火和其他紧急情况的响应中,快速、高效和可靠的通信至关重要。实现这一目标的一种方法是通过协调携带通信设备的自主无人机群,在紧急救援人员之间以及他们与中央指挥中心之间建立一个临时网络。然而,在这种极端环境中操作可能导致单个网络代理受损或失效,这可能会导致整个网络崩溃并中断通信。为了解决这个问题,并在困难环境中实现多智能体无人机联网,本文介绍了鲁棒任务网络化下的损耗问题(RTNUA),它将多机器人系统的连通性维护扩展到明确地解决主动冗余和损耗恢复的问题。我们引入了基于物理信息的多代理网络鲁棒就业算法($\Phi$IREMAN),这是一种利用物理学启发势场的拓扑算法,用于解决这个问题。通过在25种问题配置下的模拟测试,$\Phi$IREMAN始终优于DCCRS基准,在多达100项任务和500架无人机的大规模问题上,即使面临严重损耗也能维持超过99.9%的任务运行时间,展示了其有效性和可扩展性。
https://arxiv.org/abs/2512.02079
This paper presents the development of a control law, which is intended to be implemented on an optical guided glider. This guiding law follows an innovative approach, the reinforcement learning. This control law is used to make navigation more flexible and autonomous in a dynamic environment. The final objective is to track a target detected with the camera and then guide the glider to this point with high precision. Already applied on quad-copter drones, we wish by this study to demonstrate the applicability of reinforcement learning for fixed-wing aircraft on all of its axis.
本文介绍了开发的一种控制法则,旨在应用于光学导引滑翔机上。该引导法则采用了一种创新的方法——强化学习。此控制法则是为了在动态环境中使导航更加灵活和自主而设计的。最终目标是通过摄像头检测到的目标进行追踪,并以高精度引导滑翔机到达该点。已经在四旋翼无人机上应用了这种方法,希望通过这项研究展示强化学习在固定翼飞机所有轴上的适用性。
https://arxiv.org/abs/2512.01066
World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.
https://arxiv.org/abs/2511.22815
Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics problem, attaching statistically rigorous uncertainty is not well understood without strict distributional assumptions. We develop distribution-free pose uncertainty bounds about a given pose estimate in the monocular setting. Our pose uncertainty only requires high probability noise bounds on pixel detections of 2D semantic keypoints on a known object. This noise model induces an implicit, non-convex set of pose uncertainty constraints. Our key contribution is SLUE (S-Lemma Uncertainty Estimation), a convex program to reduce this set to a single ellipsoidal uncertainty bound that is guaranteed to contain the true object pose with high probability. SLUE solves a relaxation of the minimum volume bounding ellipsoid problem inspired by the celebrated S-lemma. It requires no initial guess of the bound's shape or size and is guaranteed to contain the true object pose with high probability. For tighter uncertainty bounds at the same confidence, we extend SLUE to a sum-of-squares relaxation hierarchy which is guaranteed to converge to the minimum volume ellipsoidal uncertainty bound for a given set of keypoint constraints. We show this pose uncertainty bound can easily be projected to independent translation and axis-angle orientation bounds. We evaluate SLUE on two pose estimation datasets and a real-world drone tracking scenario. Compared to prior work, SLUE generates substantially smaller translation bounds and competitive orientation bounds. We release code at this https URL.
https://arxiv.org/abs/2511.21666
Unmanned Aerial Vehicles (UAVs) are crucial in Search and Rescue (SAR) missions due to their ability to monitor vast maritime areas. However, small objects often remain difficult to detect from high altitudes due to low object-to-background pixel ratios. We propose an altitude-aware dynamic tiling method that scales and adaptively subdivides the image into tiles for enhanced small object detection. By integrating altitude-dependent scaling with an adaptive tiling factor, we reduce unnecessary computation while maintaining detection performance. Tested on the SeaDronesSee dataset [1] with YOLOv5 [2] and Slicing Aided Hyper Inference (SAHI) framework [3], our approach improves Mean Average Precision (mAP) for small objects by 38% compared to a baseline and achieves more than double the inference speed compared to static tiling. This approach enables more efficient and accurate UAV-based SAR operations under diverse conditions.
https://arxiv.org/abs/2511.19728
Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
https://arxiv.org/abs/2511.19134
This paper addresses the challenge of coordinating multi-robot systems under realistic communication delays using distributed optimization. We focus on consensus ADMM as a scalable framework for generating collision-free, dynamically feasible motion plans in both trajectory optimization and receding-horizon control settings. In practice, however, these algorithms are sensitive to penalty tuning or adaptation schemes (e.g. residual balancing and adaptive parameter heuristics) that do not explicitly consider delays. To address this, we introduce a Delay-Aware ADMM (DA-ADMM) variant that adapts penalty parameters based on real-time delay statistics, allowing agents to down-weight stale information and prioritize recent updates during consensus and dual updates. Through extensive simulations in 2D and 3D environments with double-integrator, Dubins-car, and drone dynamics, we show that DA-ADMM significantly improves robustness, success rate, and solution quality compared to fixed-parameter, residual-balancing, and fixed-constraint baselines. Our results highlight that performance degradation is not solely determined by delay length or frequency, but by the optimizer's ability to contextually reason over delayed information. The proposed DA-ADMM achieves consistently better coordination performance across a wide range of delay conditions, offering a principled and efficient mechanism for resilient multi-robot motion planning under imperfect communication.
https://arxiv.org/abs/2511.18703
Accurate localization is essential for marine robotics, yet Global Navigation Satellite System (GNSS) signals are unreliable or unavailable even at a very short distance below the water surface. Traditional alternatives, such as inertial navigation, Doppler Velocity Loggers (DVL), SLAM, and acoustic methods, suffer from error accumulation, high computational demands, or infrastructure dependence. In this work, we present a scalable multi-drone GNSS-based tracking system for surface and near-surface marine robots. Our approach combines efficient visual detection, lightweight multi-object tracking, GNSS-based triangulation, and a confidence-weighted Extended Kalman Filter (EKF) to provide stable GNSS estimation in real time. We further introduce a cross-drone tracking ID alignment algorithm that enforces global consistency across views, enabling robust multi-robot tracking with redundant aerial coverage. We validate our system in diversified complex settings to show the scalability and robustness of the proposed algorithm.
https://arxiv.org/abs/2511.18694
Autonomous inspection in hazardous environments requires AI agents that can interpret high-level goals and execute precise control. A key capability for such agents is spatial grounding, for example when a drone must center a detected object in its camera view to enable reliable inspection. While large language models provide a natural interface for specifying goals, using them directly for visual control achieves only 58\% success in this task. We envision that equipping agents with a world model as a tool would allow them to roll out candidate actions and perform better in spatially grounded settings, but conventional world models are data and compute intensive. To address this, we propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision. The model leverages global action embeddings and complementary training losses to stabilize learning. In experiments, our approach achieves 71\% success and generalizes to unseen images and instructions, highlighting the potential of compact, domain-specific latent dynamics models for spatial alignment in autonomous inspection.
https://arxiv.org/abs/2511.18319
Three-dimensional rigid-body transforms, i.e. rotations and translations, are central to modern differentiable machine learning pipelines in robotics, vision, and simulation. However, numerically robust and mathematically correct implementations, particularly on SO(3), are error-prone due to issues such as axis conventions, normalizations, composition consistency and subtle errors that only appear in edge cases. SciPy's this http URL module is a rigorously tested Python implementation. However, it historically only supported NumPy, limiting adoption in GPU-accelerated and autodiff-based workflows. We present a complete overhaul of SciPy's this http URL functionality that makes it compatible with any array library implementing the Python array API, including JAX, PyTorch, and CuPy. The revised implementation preserves the established SciPy interface while enabling GPU/TPU execution, JIT compilation, vectorized batching, and differentiation via native autodiff of the chosen backend. We demonstrate how this foundation supports differentiable scientific computing through two case studies: (i) scalability of 3D transforms and rotations and (ii) a JAX drone simulation that leverages SciPy's Rotation for accurate integration of rotational dynamics. Our contributions have been merged into SciPy main and will ship in the next release, providing a framework-agnostic, production-grade basis for 3D spatial math in differentiable systems and ML.
https://arxiv.org/abs/2511.18157
Remote sensing imagery is widely used across various fields, yet real-time detection remains challenging due to the prevalence of small objects and the need to balance accuracy with efficiency. To address this, we propose DMG-YOLO, a lightweight real-time detector tailored for small object detection in remote sensing images. Specifically, we design a Dual-branch Feature Extraction (DFE) module in the backbone, which partitions feature maps into two parallel branches: one extracts local features via depthwise separable convolutions, and the other captures global context using a vision transformer with a gating mechanism. Additionally, a Multi-scale Feature Fusion (MFF) module with dilated convolutions enhances multi-scale integration while preserving fine details. In the neck, we introduce the Global and Local Aggregate Feature Pyramid Network (GLAFPN) to further boost small object detection through global-local feature fusion. Extensive experiments on the VisDrone2019 and NWPU VHR-10 datasets show that DMG-YOLO achieves competitive performance in terms of mAP, model size, and other key metrics.
https://arxiv.org/abs/2511.17147
The rapid emergence of airborne platforms and imaging sensors is enabling new forms of aerial surveillance due to their unprecedented advantages in scale, mobility, deployment, and covert observation capabilities. This paper provides a comprehensive overview of 150+ papers over the last 10 years of human-centric aerial surveillance tasks from a computer vision and machine learning perspective. It aims to provide readers with an in-depth systematic review and technical analysis of the current state of aerial surveillance tasks using drones, UAVs, and other airborne platforms. The object of interest is humans, where human subjects are to be detected, identified, and re-identified. More specifically, for each of these tasks, we first identify unique challenges in performing these tasks in an aerial setting compared to the popular ground-based setting and subsequently compile and analyze aerial datasets publicly available for each task. Most importantly, we delve deep into the approaches in the aerial surveillance literature with a focus on investigating how they presently address aerial challenges and techniques for improvement. We conclude the paper by discussing the gaps and open research questions to inform future research avenues.
https://arxiv.org/abs/2511.17674