Abstract
The safety validation of autonomous robotic vehicles hinges on systematically testing their planning and control stacks against rare, safety-critical scenarios. Mining these long-tail events from massive real-world driving logs is therefore a critical step in the robotic development lifecycle. The goal of the Scenario Mining task is to retrieve useful information to enable targeted re-simulation, regression testing, and failure analysis of the robot's decision-making algorithms. RefAV, introduced by the Argoverse team, is an end-to-end framework that uses large language models (LLMs) to spatially and temporally localize scenarios described in natural language. However, this process performs retrieval on trajectory labels, ignoring the direct connection between natural language and raw RGB images, which runs counter to the intuition of video retrieval; it also depends on the quality of upstream 3D object detection and tracking. Further, inaccuracies in trajectory data lead to inaccuracies in downstream spatial and temporal localization. To address these issues, we propose Robust Scenario Mining for Robotic Autonomy from Coarse to Fine (SMc2f), a coarse-to-fine pipeline that employs vision-language models (VLMs) for coarse image-text filtering, builds a database of successful mining cases on top of RefAV and automatically retrieves exemplars to few-shot condition the LLM for more robust retrieval, and introduces text-trajectory contrastive learning to pull matched pairs together and push mismatched pairs apart in a shared embedding space, yielding a fine-grained matcher that refines the LLM's candidate trajectories. Experiments on public datasets demonstrate substantial gains in both retrieval quality and efficiency.
Abstract (translated)
自主机器人车辆的安全验证依赖于对其规划和控制系统进行系统性测试,尤其是在罕见的、关键安全场景下。因此,从大规模的真实世界驾驶日志中挖掘这些长尾事件是机器人开发生命周期中的一个关键步骤。情景挖掘任务的目标是从自然语言描述的情景中检索有用的信息,以支持针对特定目标的重新模拟、回归测试和故障分析。 RefAV是由Argoverse团队提出的一种端到端框架,该框架使用大型语言模型(LLMs)来空间和时间定位用自然语言描述的情景。然而,这一过程基于轨迹标签进行检索,忽略了自然语言与原始RGB图像之间的直接联系,这违背了视频检索的直觉;此外,它还依赖于上游3D目标检测和跟踪的质量。不准确的轨迹数据会导致下游的空间和时间定位不准确。 为了解决这些问题,我们提出了从粗到细的情景挖掘方法(SMc2f),这是一种采用视觉-语言模型(VLMs)进行图像文本过滤、在RefAV基础上构建成功挖掘案例数据库并自动检索实例以对LLM进行少样本条件化从而提高检索鲁棒性,并引入了基于文本和轨迹的对比学习,将匹配对拉近并在共享嵌入空间中将不匹配对推开的方法。这种方法最终形成了一种细粒度匹配器,用于精炼LLM候选轨迹。 在公共数据集上的实验表明,在检索质量和效率方面取得了显著提升。
URL
https://arxiv.org/abs/2601.12010