Abstract
Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
Abstract (translated)
地理空间协同飞行器通过自然语言指令解锁了执行地球观测(EO)应用前所未有的潜力。然而,现有的代理依赖于过于简单的单一任务和基于模板的提示,与现实世界的场景存在割裂。在这项工作中,我们提出了GeoLLM-Engine,一个由远程 sensing 平台上的分析师定期执行复杂任务的工具增强代理的环境。我们通过添加地理空间 API 工具、动态地图/UI 和外部多模态知识库来丰富我们的环境,以便更准确地衡量代理在解释真实高级自然语言命令方面的熟练程度及其在任务完成中的功能性正确性。通过减轻与人类在环基准 Curation 相关的开销,我们在100个GPT-4-Turbo节点上充分利用我们的大规模并行引擎,扩展到超过50000个多样化的多工具任务和1100万卫星图像。通过超越传统单一任务图像捕捉范例,我们研究了最先进的代理和提示技术对抗长距离提示的现状。
URL
https://arxiv.org/abs/2404.15500