Paper Reading AI Learner

From Occlusion to Insight: Object Search in Semantic Shelves using Large Language Models

2023-02-24 22:01:20
Satvik Sharma, Kaushik Shivakumar, Huang Huang, Ryan Hoque, Alishba Imran, Brian Ichter, Ken Goldberg

Abstract

How can a robot efficiently extract a desired object from a shelf when it is fully occluded by other objects? Prior works propose geometric approaches for this problem but do not consider object semantics. Shelves in pharmacies, restaurant kitchens, and grocery stores are often organized such that semantically similar objects are placed close to one another. Can large language models (LLMs) serve as semantic knowledge sources to accelerate robotic mechanical search in semantically arranged environments? With Semantic Spatial Search on Shelves (S^4), we use LLMs to generate affinity matrices, where entries correspond to semantic likelihood of physical proximity between objects. We derive semantic spatial distributions by synthesizing semantics with learned geometric constraints. S^4 incorporates Optical Character Recognition (OCR) and semantic refinement with predictions from ViLD, an open-vocabulary object detection model. Simulation experiments suggest that semantic spatial search reduces the search time relative to pure spatial search by an average of 24% across three domains: pharmacy, kitchen, and office shelves. A manually collected dataset of 100 semantic scenes suggests that OCR and semantic refinement improve object detection accuracy by 35%. Lastly, physical experiments in a pharmacy shelf suggest 47.1% improvement over pure spatial search. Supplementary material can be found at this https URL.

Abstract (translated)

当机器人在货架上无法提取被其他物体完全遮挡的目标时,如何高效地提取它?先前的研究提出了几何方法来解决这一问题,但并未考虑对象语义。在药房、餐厅厨房和超市货架上,通常将语义相似的物体放置在相邻的位置。大型语言模型(LLM)可以充当语义知识源,以加速在语义排序环境中的机器人机械搜索?使用货架上的语义空间搜索(S^4),我们使用LLM生成亲和力矩阵,其中 entries对应于两个物体之间语义亲近的可能性。通过合成语义与学习到的几何约束的摘要,我们推导出了语义空间分布。S^4将光学字符识别(OCR)和语义优化与VIDD(一个开放词汇表对象检测模型的预测)集成在一起。模拟实验表明,语义空间搜索相对于纯空间搜索平均减少了搜索时间24%,包括三个领域:药房、厨房和办公室货架。手动收集的100个语义场景数据集表明,OCR和语义优化可以提高物体检测精度35%。最后,在药房货架上进行的物理实验表明,相对于纯空间搜索,提高了47.1%。补充材料可以在此处找到。

URL

https://arxiv.org/abs/2302.12915

PDF

https://arxiv.org/pdf/2302.12915.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot