Paper Reading AI Learner

Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping

2024-04-09 13:01:26
Anas Gouda, Max Schwarz, Christopher Reining, Sven Behnke, Alice Kirchheim

Abstract

Foundation models are a strong trend in deep learning and computer vision. These models serve as a base for applications as they require minor or no further fine-tuning by developers to integrate into their applications. Foundation models for zero-shot object segmentation such as Segment Anything (SAM) output segmentation masks from images without any further object information. When they are followed in a pipeline by an object identification model, they can perform object detection without training. Here, we focus on training such an object identification model. A crucial practical aspect for an object identification model is to be flexible in input size. As object identification is an image retrieval problem, a suitable method should handle multi-query multi-gallery situations without constraining the number of input images (e.g. by having fixed-size aggregation layers). The key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids. CTL yields high accuracy, avoids misleading training signals and keeps the model input size flexible. In our experiments, we establish a new state of the art on the ArmBench object identification task, which shows general applicability of our model. We furthermore demonstrate an integrated unseen object detection pipeline on the challenging HOPE dataset, which requires fine-grained detection. There, our pipeline matches and surpasses related methods which have been trained on dataset-specific data.

Abstract (translated)

基础模型是深度学习和计算机视觉领域的一个强趋势。这些模型作为应用程序的基础,无需进一步开发人员的调整即可集成到应用程序中。例如,零 shot 对象分割的基础模型 Segment Anything (SAM) 在图像中输出没有任何进一步物体信息的分割掩码。当它们在处理物体识别模型的流水线中时,它们可以无需训练进行物体检测。在这里,我们关注训练这样的物体识别模型。 对于物体识别模型来说,一个关键的实践方面是灵活的输入尺寸。由于物体识别是一个图像检索问题,因此应该采用一种合适的方法来处理多查询多馆的情况,而不会限制输入图像的数量(例如,通过具有固定大小的聚合层)。训练这种模型的关键解决方案是聚类三角损失(CTL),它将图像特征聚合到它们的聚类中心。CTL 产生高准确度,避免了误导性的训练信号,并保持模型的输入尺寸灵活。 在我们的实验中,我们在 ARM-Bench 物体识别任务上取得了新颖的状态,这表明我们的模型具有普遍的应用价值。此外,我们还展示了在具有挑战性的 HOPE 数据集上集成的未见过的物体检测流水线,该数据集需要细粒度的检测。在那里,我们的流水线与相关方法相匹配甚至超过了它们。

URL

https://arxiv.org/abs/2404.06277

PDF

https://arxiv.org/pdf/2404.06277.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot