Abstract
Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.
Abstract (translated)
无人飞行器(UAVs)是灾后搜索和救援的关键工具,面对着信息密度高、视角快速变化以及结构动态等挑战,尤其是在长距离导航方面。然而,当前的UAV视觉与语言导航(VLN)方法在复杂的环境中难以建模长时间的空间时间上下文,导致语义对齐不准确且路径规划不稳定。为此,我们提出了LongFly——一个用于长距离UAV VLN的空间时间上下文建模框架。LongFly提出了一种历史感知的空间时间建模策略,能够将片段化和冗余的历史数据转换为结构化、紧凑且表达丰富的表示。 首先,我们设计了基于槽的历史图像压缩模块,该模块可以动态提炼多视角历史观测结果并将其转化为固定长度的上下文表示。然后引入空间时间轨迹编码模块以捕捉UAV轨迹中的时间动力学及空间结构。最后,为了整合现有的空间时间上下文与当前观察结果,我们设计了引导式跨模态集成模块来支持基于时间的推理和稳健的目标点预测。 实验结果显示,在成功率上,LongFly相较于最先进的UAV VLN基线提高了7.89%,在根据路径长度加权的成功率方面则提升了6.33%。这些改进在已见及未见过的环境中均保持一致。
URL
https://arxiv.org/abs/2512.22010