Abstract
Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training protocol (JIST) that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using 8 times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths. Code is available at this https URL
Abstract (translated)
视觉空间识别的目标是基于视觉线索识别之前访问过的地方,应用于机器人应用中的SLAM和定位。由于通常移动机器人可以访问连续的帧流,因此将此任务自然地表示为序列到序列的定位问题。然而,获得带标签的序列数据比收集孤立的图像要昂贵得多,这可以通过自动方式完成,几乎没有监督。为了缓解这个问题,我们提出了一种名为Joint Image and Sequence Training (JIST)的新方法,通过多任务学习框架利用大量未标注的图像。借助JIST,我们还引入了SeqGeM,一个聚合层,它回顾了流行的GeM池化,从序列单帧嵌入中产生一个稳健且紧凑的嵌入。我们证明了我们的模型可以在保持前 state-of-the-art性能的同时更快地运行,使用8倍较小的描述符,具有更轻的架构,并允许处理不同长度的序列。代码可以从此链接获取:
URL
https://arxiv.org/abs/2403.19787