JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition

2024-03-28 19:11:26
Gabriele Berton, Gabriele Trivigno, Barbara Caputo, Carlo Masone


Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training protocol (JIST) that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using 8 times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths. Code is available at this https URL

视觉空间识别的目标是基于视觉线索识别之前访问过的地方,应用于机器人应用中的SLAM和定位。由于通常移动机器人可以访问连续的帧流,因此将此任务自然地表示为序列到序列的定位问题。然而,获得带标签的序列数据比收集孤立的图像要昂贵得多,这可以通过自动方式完成,几乎没有监督。为了缓解这个问题,我们提出了一种名为Joint Image and Sequence Training (JIST)的新方法,通过多任务学习框架利用大量未标注的图像。借助JIST,我们还引入了SeqGeM,一个聚合层,它回顾了流行的GeM池化,从序列单帧嵌入中产生一个稳健且紧凑的嵌入。我们证明了我们的模型可以在保持前 state-of-the-art性能的同时更快地运行,使用8倍较小的描述符,具有更轻的架构,并允许处理不同长度的序列。代码可以从此链接获取:



