Paper Reading AI Learner

JIST: Joint Image and Sequence Training for Sequential Visual Place Recognition

2024-03-28 19:11:26
Gabriele Berton, Gabriele Trivigno, Barbara Caputo, Carlo Masone

Abstract

Visual Place Recognition aims at recognizing previously visited places by relying on visual clues, and it is used in robotics applications for SLAM and localization. Since typically a mobile robot has access to a continuous stream of frames, this task is naturally cast as a sequence-to-sequence localization problem. Nevertheless, obtaining sequences of labelled data is much more expensive than collecting isolated images, which can be done in an automated way with little supervision. As a mitigation to this problem, we propose a novel Joint Image and Sequence Training protocol (JIST) that leverages large uncurated sets of images through a multi-task learning framework. With JIST we also introduce SeqGeM, an aggregation layer that revisits the popular GeM pooling to produce a single robust and compact embedding from a sequence of single-frame embeddings. We show that our model is able to outperform previous state of the art while being faster, using 8 times smaller descriptors, having a lighter architecture and allowing to process sequences of various lengths. Code is available at this https URL

Abstract (translated)

视觉空间识别的目标是基于视觉线索识别之前访问过的地方,应用于机器人应用中的SLAM和定位。由于通常移动机器人可以访问连续的帧流,因此将此任务自然地表示为序列到序列的定位问题。然而,获得带标签的序列数据比收集孤立的图像要昂贵得多,这可以通过自动方式完成,几乎没有监督。为了缓解这个问题,我们提出了一种名为Joint Image and Sequence Training (JIST)的新方法,通过多任务学习框架利用大量未标注的图像。借助JIST,我们还引入了SeqGeM,一个聚合层,它回顾了流行的GeM池化,从序列单帧嵌入中产生一个稳健且紧凑的嵌入。我们证明了我们的模型可以在保持前 state-of-the-art性能的同时更快地运行,使用8倍较小的描述符,具有更轻的架构,并允许处理不同长度的序列。代码可以从此链接获取:

URL

https://arxiv.org/abs/2403.19787

PDF

https://arxiv.org/pdf/2403.19787.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot