Autonomous Unmanned Underwater Vehicles (UUVs) enable military and civilian covert operations in coastal areas without relying on support vessels or Global Navigation Satellite Systems (GNSS). Such operations are critical when surface access is not possible and stealthy navigation is required in restricted environments such as protected zones or dangerous areas under access ban. GNSS denied navigation is then essential to maintaining concealment as surfacing could expose UUVs to detection. To ensure a precise fleet positioning a constellation of beacons deployed by aerial or surface drones establish a synthetic landmark network that will guide the fleet of UUVs along an optimized path from the continental shelf to the goal on the shore. These beacons either submerged or floating emit acoustic signals for UUV localisation and navigation. A hierarchical planner generates an adaptive route for the drones executing primitive actions while continuously monitoring and replanning as needed to maintain trajectory accuracy.
自主无人水下航行器(UUVs)能够在沿海区域执行军事和民用秘密行动,无需依赖支援舰船或全球导航卫星系统(GNSS)。这种操作在表面无法进入并且需要隐蔽导航的受限环境中尤为重要,例如保护区域或禁止进入的危险地区。在这种情况下,没有GNSS信号的自主导航对于保持隐藏至关重要,因为浮出水面可能会使UUV暴露于探测风险中。 为了确保精确的舰队定位,通过无人机(空中或水面)部署的一系列信标建立了一个合成地标网络,该网络将引导UUV舰队从大陆架到岸边目标沿优化路径航行。这些信标无论是沉入水下的还是漂浮在水面上的,都会发出声波信号用于UUV的定位和导航。 一个分层规划器生成了一条适应性路线供无人机执行基本动作,并且会不断监控并根据需要重新规划路线以保持航迹精度。
https://arxiv.org/abs/2601.15802
Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety. The video can be found at this https URL.
物体目标导航(ObjectNav)要求代理自主探索未知环境,并根据语义标签向特定的目标对象移动。尽管先前的工作主要研究了在二维运动情况下零样本的ObjectNav问题,但将其扩展到具有三维运动能力的空中平台仍然是一个未充分探讨的研究领域。空中机器人提供了更好的机动性和搜索效率,但也引入了空间感知、动态控制和安全保证等方面的新挑战。在这篇论文中,我们提出了AION,这是一种基于视觉的无人飞行器(UAV)物体导航框架,无需依赖外部定位或全局地图。AION是一个端到端的双策略强化学习(RL)框架,它将探索行为与目标达到行为分解为两个专门化的策略。我们在AI2-THOR基准上评估了AION,并进一步在IsaacSim中使用高保真无人机模型对其实时性能进行了测试。实验结果表明,AION在探索、导航效率和安全性等综合评价指标方面取得了卓越的性能。相关视频可以在此URL找到(原文中的链接未提供,您可以访问原始论文或提供的URL以获取视频)。
https://arxiv.org/abs/2601.15614
The use of artificial intelligence (AI) for drone control can have a transformative impact on drone capabilities, especially when real world information can be integrated with drone sensing, command, and control, part of a growing field of physical AI. Large language models (LLMs) can be advantageous if trained at scale on general knowledge, but especially and in particular when the training data includes information such as detailed map geography topology of the entire planet, as well as the ability to access real time situational data such as weather. However, challenges remain in the interface between drones and LLMs in general, with each application requiring a tedious, labor intensive effort to connect the LLM trained knowledge to drone command and control. Here, we solve that problem, using an interface strategy that is LLM agnostic and drone agnostic, providing the first universal, versatile, comprehensive and easy to use drone control interface. We do this using the new model context protocol (MCP) standard, an open standard that provides a universal way for AI systems to access external data, tools, and services. We develop and deploy a cloud based Linux machine hosting an MCP server that supports the Mavlink protocol, an ubiquitous drone control language used almost universally by millions of drones including Ardupilot and PX4 this http URL demonstrate flight control of a real unmanned aerial vehicle. In further testing, we demonstrate extensive flight planning and control capability in a simulated drone, integrated with a Google Maps MCP server for up to date, real time navigation information. This demonstrates a universal approach to integration of LLMs with drone command and control, a paradigm that leverages and exploits virtually all of modern AI industry with drone technology in an easy to use interface that translates natural language to drone control.
使用人工智能(AI)进行无人机控制可以在很大程度上改变无人机的能力,特别是在将现实世界的信息与无人机的感知、指令和控制系统相结合时,这已成为物理AI这一新兴领域的一部分。大规模语言模型(LLMs),如果经过全面的知识训练,并且训练数据包括整个地球的地图地理学拓扑信息以及实时环境数据如天气情况,会特别有用。 然而,在无人机和LLM之间接口方面仍然存在挑战,每个应用都需要费时、劳力密集的过程来将训练好的知识连接到无人机的指令与控制。在这里,我们解决了这个问题,采用了一种对LLM和无人机都无特定依赖的界面策略,提供了一个通用的、灵活的、全面且易于使用的无人机控制系统接口。为此,我们使用了新的模型上下文协议(Model Context Protocol, MCP)标准,这是一个开放的标准,为AI系统提供了访问外部数据、工具和服务的统一方式。 我们在云端部署了一台基于Linux的操作服务器,该服务器支持Mavlink协议——一种几乎被全球数百万无人机使用的通用无人机控制语言。我们通过此接口展示了一个真实无人驾驶飞行器的飞行控制演示。在进一步测试中,我们还在模拟无人机上展示了广泛的飞行规划和控制系统能力,并与Google Maps MCP服务器集成以获得最新的实时导航信息。 这证明了将LLMs与无人机指令和控制系统进行集成的一种通用方法,这种方法利用并挖掘现代AI产业几乎所有的资源来支持无人机技术,同时提供了一种易于使用的界面,能够将自然语言转换为无人机控制。
https://arxiv.org/abs/2601.15486
Autonomous drone racing represents a major frontier in robotics research. It requires an Artificial Intelligence (AI) that can run on board light-weight flying robots under tight resource and time constraints, while pushing the physical system to its limits. The state of the art in this area consists of a system with a stereo camera and an inertial measurement unit (IMU) that beat human drone racing champions in a controlled indoor environment. Here, we present MonoRace: an onboard drone racing approach that uses a monocular, rolling-shutter camera and IMU that generalizes to a competition environment without any external motion tracking system. The approach features robust state estimation that combines neural-network-based gate segmentation with a drone model. Moreover, it includes an offline optimization procedure that leverages the known geometry of gates to refine any state estimation parameter. This offline optimization is based purely on onboard flight data and is important for fine-tuning the vital external camera calibration parameters. Furthermore, the guidance and control are performed by a neural network that foregoes inner loop controllers by directly sending motor commands. This small network runs on the flight controller at 500Hz. The proposed approach won the 2025 Abu Dhabi Autonomous Drone Racing Competition (A2RL), outperforming all competing AI teams and three human world champion pilots in a direct knockout tournament. It set a new milestone in autonomous drone racing research, reaching speeds up to 100 km/h on the competition track and successfully coping with problems such as camera interference and IMU saturation.
自主无人机竞速代表了机器人研究的一个重要前沿。它需要能够在资源和时间受限的条件下运行于轻量级飞行机器人上的人工智能,同时将物理系统推向极限。该领域的最新技术包括一个配备了立体摄像头和惯性测量单元(IMU)的系统,在受控室内环境中击败了人类无人机竞速冠军。在这里,我们介绍了MonoRace:一种使用单目滚动快门相机和IMU、适用于无外部运动跟踪系统的竞争环境中的机载无人机竞速方法。这种方法的特点是结合神经网络栅栏分割与无人机模型的稳健状态估计,并且包括一个离线优化过程,该过程利用已知的栅栏几何形状来微调任何状态估算参数。此离线优化完全基于飞行数据,对于精细调整重要的外部相机校准参数至关重要。 此外,引导和控制由神经网络执行,该网络通过直接发送电机命令而跳过了内部回路控制器。这个小型网络在飞行控制器上以每秒500赫兹的频率运行。所提出的方法赢得了2025年阿布扎比自主无人机竞速比赛(A2RL),超过了所有参赛的人工智能团队和三名世界冠军飞行员,后者在直接淘汰赛中被击败。它在自主无人机竞速研究方面设立了新的里程碑,在竞赛赛道上达到了高达100公里/小时的速度,并成功应对了诸如相机干扰和IMU饱和等问题。
https://arxiv.org/abs/2601.15222
On and off-ramps are understudied road sections even though they introduce a higher level of variation in highway interactions. Predicting vehicles' behavior in these areas can decrease the impact of uncertainty and increase road safety. In this paper, the difference between this Area of Interest (AoI) and a straight highway section is studied. Multi-layered LSTM architecture to train the AoI model with ExiD drone dataset is utilized. In the process, different prediction horizons and different models' workflow are tested. The results show great promise on horizons up to 4 seconds with prediction accuracy starting from about 76% for the AoI and 94% for the general highway scenarios on the maximum horizon.
匝道(上、下坡路段)虽然是高速公路互动中变异性较高的部分,但一直以来却未得到充分的研究。预测这些区域内的车辆行为可以降低不确定性的影响,并提高道路安全。本文研究了这一兴趣区域(AoI)与直线高速公路段之间的差异。利用ExiD无人机数据集和多层LSTM架构来训练AoI模型,在此过程中测试了不同的预测时长和不同模型的工作流程。结果显示,对于长达4秒的预测范围,AoI的预测准确率可达到约76%,而一般高速公路场景在最大预测时长下的预测准确率为94%左右,这表明该方法具有很大的潜力。
https://arxiv.org/abs/2601.14848
Reliable drone detection is challenging due to limited annotated real-world data, large appearance variability, and the presence of visually similar distractors such as birds. To address these challenges, this paper introduces SimD3, a large-scale high-fidelity synthetic dataset designed for robust drone detection in complex aerial environments. Unlike existing synthetic drone datasets, SimD3 explicitly models drones with heterogeneous payloads, incorporates multiple bird species as realistic distractors, and leverages diverse Unreal Engine 5 environments with controlled weather, lighting, and flight trajectories captured using a 360 six-camera rig. Using SimD3, we conduct an extensive experimental evaluation within the YOLOv5 detection framework, including an attention-enhanced variant termed Yolov5m+C3b, where standard bottleneck-based C3 blocks are replaced with C3b modules. Models are evaluated on synthetic data, combined synthetic and real data, and multiple unseen real-world benchmarks to assess robustness and generalization. Experimental results show that SimD3 provides effective supervision for small-object drone detection and that Yolov5m+C3b consistently outperforms the baseline across in-domain and cross-dataset evaluations. These findings highlight the utility of SimD3 for training and benchmarking robust drone detection models under diverse and challenging conditions.
可靠的无人机检测由于缺乏注释的真实世界数据、外观变化大以及视觉上与鸟类等干扰物相似,因而面临挑战。为了解决这些问题,本文介绍了一种名为SimD3的大规模高保真合成数据集,旨在复杂空域环境中实现稳健的无人机检测。不同于现有的合成无人机数据集,SimD3明确地模拟了携带不同类型负载的无人机,并引入多种鸟类作为现实中的干扰物。此外,SimD3利用多样化的Unreal Engine 5环境,这些环境包括通过360度六相机装置捕捉到的不同天气、光照和飞行轨迹。 使用SimD3数据集,我们在YOLOv5检测框架中进行了广泛的实验评估,其中包括一种增强注意力的变体Yolov5m+C3b,在这种变体中,标准瓶颈基的C3块被替换为C3b模块。模型在合成数据、组合了合成和真实世界的数据以及多个未见过的真实世界基准上进行评估,以测试其鲁棒性和泛化能力。实验结果表明,SimD3提供了有效的监督用于检测小型无人机,并且Yolov5m+C3b在领域内(in-domain)和跨数据集(cross-dataset)的评估中始终优于基线模型。 这些发现突显了SimD3在多样化的复杂条件下训练和基准测试稳健的无人机检测模型方面的实用性。
https://arxiv.org/abs/2601.14742
The integration of agentic AI, powered by large language models (LLMs) with autonomous reasoning, planning, and execution, into unmanned aerial vehicle (UAV) swarms opens new operational possibilities and brings the vision of the Internet of Drones closer to reality. However, infrastructure constraints, dynamic environments, and the computational demands of multi-agent coordination limit real-world deployment in high-risk scenarios such as wildfires and disaster response. This paper investigates the integration of LLM-based agentic AI and edge computing to realize scalable and resilient autonomy in UAV swarms. We first discuss three architectures for supporting UAV swarms - standalone, edge-enabled, and edge-cloud hybrid deployment - each optimized for varying autonomy and connectivity levels. Then, a use case for wildfire search and rescue (SAR) is designed to demonstrate the efficiency of the edge-enabled architecture, enabling high SAR coverage, reduced mission completion times, and a higher level of autonomy compared to traditional approaches. Finally, we highlight open challenges in integrating LLMs and edge computing for mission-critical UAV-swarm applications.
基于大型语言模型(LLMs)自主推理、规划和执行能力的代理型人工智能集成到无人飞行器(UAV)集群中,开创了新的操作可能性,并使无人机互联网的概念更加接近现实。然而,基础设施限制、动态环境以及多智能体协调计算需求限制了在野火和灾害响应等高风险场景中的实际部署。本文探讨了将基于LLM的代理型人工智能与边缘计算结合,以实现UAV集群中可扩展且具有弹性的自主性。首先,我们讨论支持UAV集群的三种架构——独立式、边缘增强式以及边缘-云混合部署,每种架构都针对不同级别的自主性和连接性进行了优化。然后,设计了一个用于野火搜索和救援(SAR)的实际案例,以展示边缘增强式架构的有效性,该架构能够提高SAR覆盖范围,减少任务完成时间,并比传统方法实现更高的自主水平。最后,我们强调了在关键任务型UAV集群应用中整合LLMs与边缘计算所面临的开放挑战。
https://arxiv.org/abs/2601.14437
Neural image codecs achieve higher compression ratios than traditional hand-crafted methods such as PNG or JPEG-XL, but often incur substantial computational overhead, limiting their deployment on energy-constrained devices such as smartphones, cameras, and drones. We propose Grayscale Image Compression with Differentiable Logic Circuits (GIC-DLC), a hardware-aware codec where we train lookup tables to combine the flexibility of neural networks with the efficiency of Boolean operations. Experiments on grayscale benchmark datasets show that GIC-DLC outperforms traditional codecs in compression efficiency while allowing substantial reductions in energy consumption and latency. These results demonstrate that learned compression can be hardware-friendly, offering a promising direction for low-power image compression on edge devices.
神经图像编解码器比传统的手工设计方法(如PNG或JPEG-XL)实现了更高的压缩比率,但通常会带来显著的计算开销,限制了它们在智能手机、相机和无人机等能量受限设备上的部署。我们提出了灰度图像压缩与可微逻辑电路 (GIC-DLC),这是一种硬件感知编解码器,我们在其中训练查找表以结合神经网络的灵活性和布尔运算的效率。在灰度基准数据集上的实验表明,GIC-DLC 在压缩效率上超越了传统编解码器,并且能够在很大程度上减少能量消耗和延迟。这些结果证明了学习到的压缩技术可以是硬件友好的,为边缘设备上的低功耗图像压缩提供了一个有前景的方向。
https://arxiv.org/abs/2601.14130
Cross-view spatial reasoning is essential for embodied AI, underpinning spatial understanding, mental simulation and planning in complex environments. Existing benchmarks primarily emphasize indoor or street settings, overlooking the unique challenges of open-ended urban spaces characterized by rich semantics, complex geometries, and view variations. To address this, we introduce CityCube, a systematic benchmark designed to probe cross-view reasoning capabilities of current VLMs in urban settings. CityCube integrates four viewpoint dynamics to mimic camera movements and spans a wide spectrum of perspectives from multiple platforms, e.g., vehicles, drones and satellites. For a comprehensive assessment, it features 5,022 meticulously annotated multi-view QA pairs categorized into five cognitive dimensions and three spatial relation expressions. A comprehensive evaluation of 33 VLMs reveals a significant performance disparity with humans: even large-scale models struggle to exceed 54.1% accuracy, remaining 34.2% below human performance. By contrast, small-scale fine-tuned VLMs achieve over 60.0% accuracy, highlighting the necessity of our benchmark. Further analyses indicate the task correlations and fundamental cognitive disparity between VLMs and human-like reasoning.
跨视角空间推理对于具身人工智能(Embodied AI)至关重要,它是复杂环境中空间理解、心理模拟和规划的基础。现有基准测试主要集中在室内或街道场景上,忽略了开放型城市空间的独特挑战,这些地方以其丰富的语义信息、复杂的几何形状和视图变化为特点。为了应对这一问题,我们引入了CityCube,这是一个系统性基准测试工具,旨在评估当前视觉语言模型(VLMs)在城市环境中跨视角推理能力的表现。 CityCube整合了四种视角动态来模拟相机移动,并跨越多个平台的广泛视角范围进行操作,如车辆、无人机和卫星。为了进行全面评价,它包括5,022对细致标注的多视图问答配对,这些配对被分类为五种认知维度和三种空间关系表达方式。 33个VLM模型的综合评估显示了与人类表现之间的显著差异:即使大规模模型也难以超过54.1%的准确率,比人类的表现低34.2个百分点。相比之下,小型精细调优的VLM可以达到60.0%以上的准确率,这凸显了我们基准测试工具的重要性。 进一步分析表明,任务之间存在相关性以及视觉语言模型与人类类似推理之间的基本认知差异。
https://arxiv.org/abs/2601.14339
As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.
随着空中平台从被动观察者发展为主动操作者,设计直观的用户界面成为挑战,以使非专家用户能够自然地控制这些系统。这项工作提出了一种新颖的概念——自主空中操控系统,该系统能理解高级别的自然语言指令来检索物体并将其交付给人类用户。这个系统整合了基于Grounding DINO和MediaPipe的Vision-Language-Action (VLA)模型以及一个配备了1-DOF夹具和Intel RealSense RGB-D相机的定制无人机。 VLA模型执行语义推理,以理解用户的意图,并生成优先级任务队列来抓取场景中的相关物体。使用Grounding DINO和动态A*规划算法进行导航并安全地重新定位对象。为了确保在交接阶段的安全且自然互动,该系统采用了由MediaPipe驱动的人体中心控制器模块。这个模块提供实时人体姿态估计功能,使无人机能够采用视觉伺服技术来保持一个稳定、明显的姿态,直接位于用户前方,从而实现舒适的物体交付。 通过实际实验验证了系统的有效性,在定位和导航方面取得了最大误差为0.164米,平均欧氏距离误差为0.070米,均方根误差为0.084米的成绩。这些结果突显了VLA模型在空中操控操作中的可行性。
https://arxiv.org/abs/2601.13809
Drones operating in human-occupied spaces suffer from insufficient communication mechanisms that create uncertainty about their intentions. We present HoverAI, an embodied aerial agent that integrates drone mobility, infrastructure-independent visual projection, and real-time conversational AI into a unified platform. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users through vision and voice, responding via lip-synced avatars that adapt appearance to user demographics. The system employs a multimodal pipeline combining VAD, ASR (Whisper), LLM-based intent classification, RAG for dialogue, face analysis for personalization, and voice synthesis (XTTS v2). Evaluation demonstrates high accuracy in command recognition (F1: 0.90), demographic estimation (gender F1: 0.89, age MAE: 5.14 years), and speech transcription (WER: 0.181). By uniting aerial robotics with adaptive conversational AI and self-contained visual output, HoverAI introduces a new class of spatially-aware, socially responsive embodied agents for applications in guidance, assistance, and human-centered interaction.
在有人类活动的空间中运行的无人机由于通信机制不足,导致人们对它们的意图存在不确定性。我们提出了HoverAI,这是一种集成了无人机移动性、基础设施独立的视觉投影和实时对话人工智能于一体的综合平台的实体空中代理。通过配备MEMS激光投影仪、机载半刚性屏幕以及RGB相机,HoverAI能够通过视觉和语音感知用户,并利用与其人口统计特征相适应的唇同步化身做出回应。 该系统采用了一种多模态管道,结合了语音活动检测(VAD)、自动语音识别(使用Whisper模型)、基于大型语言模型的意图分类、RAG对话系统、面部分析以实现个性化以及语音合成(XTTS v2)等技术。评估结果显示,在命令识别中表现出高准确性(F1: 0.90),人口统计估计准确率在性别上为0.89,年龄误差平均绝对值仅为5.14岁,并且在语音转录中的错误率为0.181。 通过将空中机器人技术与适应性对话人工智能及自包含视觉输出相结合,HoverAI引入了一种新型的具有空间感知和社会响应能力的实体代理。这种新技术适用于指导、协助和以人为中心的人机互动应用中。
https://arxiv.org/abs/2601.13801
Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
玻璃表面在日常生活中和专业环境中普遍存在,这给基于视觉的系统(如机器人和无人机导航)带来了潜在威胁。为了解决这一挑战,最近的研究对视频玻璃面检测(VGSD)表现出了浓厚的兴趣。我们观察到,在反射层或透射层中的物体似乎距离玻璃更远。因此,在视频运动场景中,相较于同一平面内的非玻璃区域里的对象,玻璃表面上的显著反射(或透射)物体移动得较慢,这种运动不一致性可以有效地揭示玻璃表面的存在。 基于这一观察,我们提出了一种名为MVGD-Net的新网络,用于通过利用运动不一致线索来检测视频中的玻璃面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),该模块整合了提取的空间特征和估计的光流图;历史引导注意模块(HGAM)以及时间交叉注意模块(TCAM),这两个模块进一步增强了时序特征。此外,还引入了一个时空解码器(TSD),用于融合空间和时间特征以生成玻璃区域掩模。 为了训练我们的网络,我们还提出了一套大规模的数据集,其中包括312种多样的玻璃场景,总计有19,268帧。广泛的实验表明,与相关最先进的方法相比,我们的MVGD-Net在性能上取得了优越的结果。
https://arxiv.org/abs/2601.13715
Micro-Unmanned Aerial Vehicles (UAVs) are rapidly expanding into tasks from inventory to environmental sensing, yet their short endurance and unreliable navigation in GPS-denied spaces limit deployment. Lighter-Than-Air (LTA) drones offer an energy-efficient alternative: they use a helium envelope to provide buoyancy, which enables near-zero-power drain during hovering and much longer operation. LTAs are promising, but their design is complex, and they lack integrated solutions to enable sustained autonomous operations and navigation with simple, low-infrastructure. We propose a compact, self-sustaining LTA drone that uses light for both energy harvesting and navigation. Our contributions are threefold: (i) a high-fidelity simulation framework to analyze LTA aerodynamics and select a stable, efficient configuration; (ii) a framework to integrate solar cells on the envelope to provide net-positive energy; and (iii) a point-and-go navigation system with three light-seeking algorithms operating on a single light beacon. Our LTA-analysis, together with the integrated solar panels, not only saves energy while flying, but also enables sustainable operation: providing 1 minute of flying time for every 4 minutes of energy harvesting, under illuminations of 80klux. We also demonstrate robust single-beacon navigation towards a light source that can be up to 7m away, in indoor and outdoor environments, even with moderate winds. The resulting system indicates a plausible path toward persistent, autonomous operation for indoor and outdoor monitoring. More broadly, this work provides a practical pathway for translating the promise of LTA drones into a persistent, self-sustaining aerial system.
微型无人驾驶飞机(UAV)正迅速扩展到从库存管理到环境监测等各种任务中,但它们的短续航时间和在无GPS区域导航不稳定的特性限制了其部署。轻于空气(LTA)无人机提供了一种节能的替代方案:利用氦气囊提供浮力,使悬停时接近零能量消耗,并且飞行时间更长。尽管LTAs具有潜力,但设计复杂且缺乏集成解决方案以实现持续自主操作和低基础设施依赖导航。 我们提出一种紧凑、自给自足的LTA无人机,它使用光来同时进行能源收集和导航。我们的贡献包括三个方面:(i)一个高保真模拟框架用于分析LTA气动力学并选择稳定高效的配置;(ii)一个将太阳能电池集成到气囊上的框架以提供净正能量;以及(iii)一种基于单光源的“指哪去哪”导航系统,它使用三种寻光算法。 通过我们的LTA分析和集成太阳能板不仅节省了飞行时的能量消耗,还实现了可持续操作:在80klux照明条件下,每4分钟充电可提供1分钟的飞行时间。此外,我们展示了即使在有中等风速的情况下,室内和室外环境中的单光源导航也具有鲁棒性,并且光源距离可达7米。 该系统为实现持续自主的操作(用于室内和室外监测)指明了一条可能的道路。更广泛地说,这项工作提供了一种将LTA无人机的潜力转化为持久自给飞行系统的实用路径。
https://arxiv.org/abs/2601.13088
Recent advances in large Vision-Language Models (VLMs) have provided rich semantic understanding that empowers drones to search for open-set objects via natural language instructions. However, prior systems struggle to integrate VLMs into practical aerial systems due to orders-of-magnitude frequency mismatch between VLM inference and real-time planning, as well as VLMs' limited 3D scene understanding. They also lack a unified mechanism to balance semantic guidance with motion efficiency in large-scale environments. To address these challenges, we present AirHunt, an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments by seamlessly fusing VLM semantic reasoning with continuous path planning. AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM reasoning and path planning, enabling continuous flight with adaptive semantic guidance that evolves through motion. Moreover, we propose an active dual-task reasoning module that exploits geometric and semantic redundancy to enable selective VLM querying, and a semantic-geometric coherent planning module that dynamically reconciles semantic priorities and motion efficiency in a unified framework, enabling seamless adaptation to environmental heterogeneity. We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time compared to state-of-the-art methods. Real-world experiments further validate AirHunt's practical capability in complex and challenging environments. Code and dataset will be made publicly available before publication.
近期在大型视觉-语言模型(VLMs)方面的进展为无人机通过自然语言指令搜索开放集对象提供了丰富的语义理解能力。然而,先前的系统由于VLM推断与实时规划之间频率不匹配以及VLM对3D场景理解有限的问题,在将VLM集成到实际空中系统中时遇到了困难。此外,这些系统缺乏一种统一机制来在大规模环境中平衡语义引导和运动效率。为了解决这些问题,我们提出了AirHunt,这是一种高效的空中物体导航系统,它通过无缝融合VLM的语义推理与连续路径规划,在户外环境中实现开放集对象的零样本泛化定位。 AirHunt具有双路径异步架构,建立了VLM推理与路径规划之间的协同接口,允许根据运动不断演变的自适应语义指导进行持续飞行。此外,我们提出了一种主动双重任务推理模块,利用几何和语义冗余实现选择性的VLM查询,并提出了一个语义-几何协调规划模块,在统一框架内动态平衡语义优先级与运动效率,使系统能够无缝地适应环境异质性。 我们在各种物体导航任务和环境中评估了AirHunt,证明它比最先进的方法具有更高的成功率、更低的导航误差以及更短的飞行时间。实际世界中的实验进一步验证了AirHunt在复杂且具挑战性的环境下的实用能力。发布前,代码和数据集将公开提供。 通过这种方式,AirHunt不仅克服了现有系统的局限性,还为无人机的应用开辟了新的可能,特别是在搜索与救援、农业监测等领域。
https://arxiv.org/abs/2601.12742
Detecting faces in overhead images remains a significant challenge due to extreme scale variations and environmental clutter. To address this, we created the BirdsEye-RU dataset, a comprehensive collection of 2,978 images containing over eight thousand annotated faces. This dataset is specifically designed to capture small and distant faces across diverse environments, containing both drone images and smartphone-captured images from high altitude. We present a detailed description of the BirdsEye-RU dataset in this paper. We made our dataset freely available to the public, and it can be accessed at this https URL.
在顶视图图像中检测人脸仍然是一个重大挑战,原因在于极端的尺度变化和环境杂乱。为了解决这一问题,我们创建了BirdsEye-RU数据集,这是一个包含2978张图片的数据集,其中标注有超过八千个面部。该数据集特别设计用于捕捉不同环境中微小且遥远的人脸,包括无人机拍摄的图像以及从高处用智能手机捕获的照片。本文详细介绍了BirdsEye-RU数据集。我们已将此数据集免费提供给公众使用,并可通过以下链接访问:[请在此插入实际URL]。
https://arxiv.org/abs/2601.12533
Counting and tracking dense crowds in large-scale scenes is highly challenging, yet existing methods mainly rely on datasets captured by fixed cameras, which provide limited spatial coverage and are inadequate for large-scale dense crowd analysis. To address this limitation, we propose a flexible solution using moving drones to capture videos and perform video-level crowd counting and tracking of unique pedestrians across entire scenes. We introduce MovingDroneCrowd++, the largest video-level dataset for dense crowd counting and tracking captured by moving drones, covering diverse and complex conditions with varying flight altitudes, camera angles, and illumination. Existing methods fail to achieve satisfactory performance on this dataset. To this end, we propose GD3A (Global Density Map Decomposition via Descriptor Association), a density map-based video individual counting method that avoids explicit localization. GD3A establishes pixel-level correspondences between pedestrian descriptors across consecutive frames via optimal transport with an adaptive dustbin score, enabling the decomposition of global density maps into shared, inflow, and outflow components. Building on this framework, we further introduce DVTrack, which converts descriptor-level matching into instance-level associations through a descriptor voting mechanism for pedestrian tracking. Experimental results show that our methods significantly outperform existing approaches under dense crowds and complex motion, reducing counting error by 47.4 percent and improving tracking performance by 39.2 percent.
在大规模场景中计数和追踪密集人群是一项极具挑战性的任务,但现有的方法主要依赖于固定摄像头捕获的数据集,这些数据集提供的空间覆盖范围有限,并不足以进行大规模密集人群的分析。为了克服这一限制,我们提出了一种灵活的解决方案,即使用移动无人机来捕捉视频并执行整个场景内的视频级人群计数和独特的行人人身追踪。为此,我们引入了MovingDroneCrowd++,这是一个由移动无人机捕获的最大规模视频级数据集,用于密集人群计数和跟踪,涵盖了各种复杂条件,包括变化的飞行高度、相机角度以及光照条件。现有方法在该数据集中未能达到满意的性能表现。 为了解决这个问题,我们提出了GD3A(通过描述符关联进行全局密度图分解),这是一种基于密度图的视频个体计数方法,避免了显式的定位步骤。GD3A利用最优传输和自适应尘盒得分,在连续帧之间建立行人的描述符像素级对应关系,从而将全局密度图分解为共享、流入以及流出成分。 在此框架基础上,我们进一步引入了DVTrack,通过描述符投票机制将描述符级别的匹配转换为实例级别的关联,以实现行人追踪。实验结果表明,我们的方法在密集人群和复杂运动条件下显著优于现有方法,在计数误差上减少了47.4%,并提高了39.2%的跟踪性能。
https://arxiv.org/abs/2601.12500
The rapid proliferation of airborne platforms, including commercial aircraft, drones, and UAVs, has intensified the need for real-time, automated threat assessment systems. Current approaches depend heavily on manual monitoring, resulting in limited scalability and operational inefficiencies. This work introduces a dual-task model based on EfficientNetB4 capable of performing airborne object classification and threat-level prediction simultaneously. To address the scarcity of clean, balanced training data, we constructed the AODTA Dataset by aggregating and refining multiple public sources. We benchmarked our approach on both the AVD Dataset and the newly developed AODTA Dataset and further compared performance against a ResNet-50 baseline, which consistently underperformed EfficientNetB4. Our EfficientNetB4 model achieved 96% accuracy in object classification and 90% accuracy in threat-level prediction, underscoring its promise for applications in surveillance, defense, and airspace management. Although the title references detection, this study focuses specifically on classification and threat-level inference using pre-localized airborne object images provided by existing datasets.
随着空中平台(包括商用飞机、无人机和无人飞行器)的迅速增多,实时自动威胁评估系统的需求变得更加迫切。目前的方法主要依赖于人工监控,这导致了可扩展性有限以及运营效率低下。本文介绍了一种基于EfficientNetB4的双任务模型,该模型能够同时执行空中物体分类与威胁等级预测。 为了解决干净、平衡训练数据不足的问题,我们通过整合和优化多个公共来源构建了AODTA数据集。我们在AVD数据集以及新开发的AODTA数据集上对我们的方法进行了基准测试,并且将其性能与ResNet-50基线模型进行了比较,后者在所有情况下均不如EfficientNetB4表现良好。我们的EfficientNetB4模型实现了96%的目标物体分类准确率和90%的威胁等级预测准确率,这表明它具有应用于监控、防御以及空域管理等领域的巨大潜力。 尽管标题提到的是检测问题,本研究主要关注利用现有数据集中提供的预定位空中物体图像进行分类与威胁等级推断。
https://arxiv.org/abs/2601.11907
We consider the problem of adaptively monitoring a wildfire front using a mobile agent (e.g., a drone), whose trajectory determines where sensor data is collected and thus influences the accuracy of fire propagation estimation. This is a challenging problem, as the stochastic nature of wildfire evolution requires the seamless integration of sensing, estimation, and control, often treated separately in existing methods. State-of-the-art methods either impose linear-Gaussian assumptions to establish optimality or rely on approximations and heuristics, often without providing explicit performance guarantees. To address these limitations, we formulate the fire front monitoring task as a stochastic optimal control problem that integrates sensing, estimation, and control. We derive an optimal recursive Bayesian estimator for a class of stochastic nonlinear elliptical-growth fire front models. Subsequently, we transform the resulting nonlinear stochastic control problem into a finite-horizon Markov decision process and design an information-seeking predictive control law obtained via a lower confidence bound-based adaptive search algorithm with asymptotic convergence to the optimal policy.
我们考虑使用移动代理(例如无人机)自适应监控野火前沿的问题,其轨迹决定了传感器数据的采集地点,并因此影响了火灾蔓延估计的准确性。这是一个具有挑战性的问题,因为野火演化的随机性质要求感测、估计和控制之间的无缝集成,而在现有方法中这些过程通常被单独处理。最先进的方法要么为了建立最优性而假设线性和高斯特性,要么依赖于近似值和启发式方法,往往不提供明确的性能保证。 为了解决这些问题,我们将野火前沿监控任务作为集成了感测、估计和控制的随机最优控制问题进行建模。我们推导出了一类具有随机非线性椭圆增长模型的野火前沿的递归贝叶斯最优估计器。接着,将由此产生的非线性随机控制问题转换为有限时间范围的马尔可夫决策过程,并通过基于下限置信度的自适应搜索算法设计了一个信息导向的预测控制法则,该方法具有渐进收敛于最优策略的特点。
https://arxiv.org/abs/2601.11231
Marker-based landing is widely used in drone delivery and return-to-base systems for its simplicity and reliability. However, most approaches assume idealized landing site visibility and sensor performance, limiting robustness in complex urban settings. We present a simulation-based evaluation suite on the AirSim platform with systematically varied urban layouts, lighting, and weather to replicate realistic operational diversity. Using onboard camera sensors (RGB for marker detection and depth for obstacle avoidance), we benchmark two heuristic coverage patterns and a reinforcement learning-based agent, analyzing how exploration strategy and scene complexity affect success rate, path efficiency, and robustness. Results underscore the need to evaluate marker-based autonomous landing under diverse, sensor-relevant conditions to guide the development of reliable aerial navigation systems.
基于标记的着陆在无人机配送和返航系统中因其简单性和可靠性而被广泛使用。然而,大多数方法假设理想化的着陆点可见度和传感器性能,这限制了其在复杂城市环境中的鲁棒性。我们提出了一种基于AirSim平台的仿真评估套件,该套件通过系统地改变城市布局、光照条件和天气情况来模拟实际操作中的多样性。使用机载摄像头传感器(RGB用于标记检测和深度信息用于障碍物规避),我们对两种启发式覆盖模式和一种基于强化学习的代理进行了基准测试,并分析了探索策略与场景复杂度如何影响成功率、路径效率以及鲁棒性。研究结果强调,需要在多变且与传感器相关的条件下评估基于标记的自主着陆系统,以指导可靠空中导航系统的开发。
https://arxiv.org/abs/2601.11078
Real-time image segmentation is a key enabler for AR/VR, robotics, drones, and autonomous systems, where tight accuracy, latency, and energy budgets must be met on resource-constrained edge devices. While U-Net offers a favorable balance of accuracy and efficiency compared to large transformer-based models, achieving real-time performance on high-resolution input remains challenging due to compute, memory, and power limits. Extreme quantization, particularly binary networks, is appealing for its hardware-friendly operations. However, two obstacles limit practicality: (1) severe accuracy degradation, and (2) a lack of end-to-end implementations that deliver efficiency on general-purpose GPUs. We make two empirical observations that guide our design. (1) An explicit zero state is essential: training with zero masking to binary U-Net weights yields noticeable sparsity. (2) Quantization sensitivity is uniform across layers. Motivated by these findings, we introduce Masked Binary U-Net (MBU-Net), obtained through a cost-aware masking strategy that prioritizes masking where it yields the highest accuracy-per-cost, reconciling accuracy with near-binary efficiency. To realize these gains in practice, we develop a GPU execution framework that maps MBU-Net to Tensor Cores via a subtractive bit-encoding scheme, efficiently implementing masked binary weights with binary activations. This design leverages native binary Tensor Core BMMA instructions, enabling high throughput and energy savings on widely available GPUs. Across 3 segmentation benchmarks, MBU-Net attains near full-precision accuracy (3% average drop) while delivering 2.04x speedup and 3.54x energy reductions over a 16-bit floating point U-Net.
实时图像分割是增强现实(AR)/虚拟现实(VR)、机器人技术、无人机和自主系统的关键推动力,这些应用需要在计算资源受限的边缘设备上实现高精度、低延迟和节能。虽然U-Net架构相比基于大型变换器模型而言,在准确性与效率之间提供了一个有利的平衡,但在处理高分辨率输入时要实现实时性能仍然面临挑战,因为这会受到计算能力、内存以及能耗限制的影响。极端量化(特别是二值网络)因其硬件友好的操作而具有吸引力。然而,有两个障碍阻碍了其实用性:一是准确性的严重下降;二是缺乏在通用GPU上实现高效能的端到端实施方案。 基于两个经验观察结果,我们设计出Masked Binary U-Net (MBU-Net)架构以克服上述挑战: 1. 显式的零态是必不可少的:使用零掩码训练二值U-Net权重能够显著提高稀疏性。 2. 量化敏感度在各层间均匀分布。 受到这些发现的启发,我们通过一种成本感知掩蔽策略引入了MBU-Net架构,这种策略优先选择那些能提供最高准确性和效率比的位置进行掩码处理,从而实现了准确性与近似二值化效率之间的协调。为了实现这一设计的实际效益,我们开发了一个GPU执行框架,该框架通过减法位编码方案将MBU-Net映射到Tensor Core上,并高效地实施了带掩蔽的二进制权重和激活函数。这种设计利用了原生的二进制Tensor Core BMMA指令,从而在广泛可用的GPU上实现了高吞吐量和节能。 在三个分割基准测试中,MBU-Net实现了接近全精度(平均下降3%)的同时,相较于16位浮点数U-Net分别提高了2.04倍的速度和3.54倍的能量效率。
https://arxiv.org/abs/2601.11660