Paper Reading AI Learner

A System for Generalized 3D Multi-Object Search

2023-03-06 14:47:38
Kaiyu Zheng, Anirudha Paul, Stefanie Tellex

Abstract

Searching for objects is a fundamental skill for robots. As such, we expect object search to eventually become an off-the-shelf capability for robots, similar to e.g., object detection and SLAM. In contrast, however, no system for 3D object search exists that generalizes across real robots and environments. In this paper, building upon a recent theoretical framework that exploited the octree structure for representing belief in 3D, we present GenMOS (Generalized Multi-Object Search), the first general-purpose system for multi-object search (MOS) in a 3D region that is robot-independent and environment-agnostic. GenMOS takes as input point cloud observations of the local region, object detection results, and localization of the robot's view pose, and outputs a 6D viewpoint to move to through online planning. In particular, GenMOS uses point cloud observations in three ways: (1) to simulate occlusion; (2) to inform occupancy and initialize octree belief; and (3) to sample a belief-dependent graph of view positions that avoid obstacles. We evaluate our system both in simulation and on two real robot platforms. Our system enables, for example, a Boston Dynamics Spot robot to find a toy cat hidden underneath a couch in under one minute. We further integrate 3D local search with 2D global search to handle larger areas, demonstrating the resulting system in a 25m$^2$ lobby area.

Abstract (translated)

寻找对象是机器人的基本技能。因此,我们希望对象搜索最终能够成为机器人的常备能力,类似于物体检测和SLAM。然而,然而,不存在适用于真实机器人和环境的3D对象搜索系统。在本文中,基于利用octree结构在3D中表示信念的最新理论框架,我们提出了GenMOS(通用多物体搜索),它是第一个在3D区域中通用的多物体搜索(MOS)系统。 GenMOS以本地区域点云观测、物体检测结果和机器人视图姿态的定位作为输入,并输出一个6D视角,通过在线规划进行移动。特别是,GenMOS通过以下方式使用点云观测:(1)模拟遮挡;(2)通知占据并初始化octree信念;(3)样本避免障碍物的信念依赖图形。我们在模拟和两个真实机器人平台上评估了我们的系统。我们的系统使例如波士顿动力的Spot机器人能够在不到一分钟的时间内找到藏在 couch下面的玩具猫。我们进一步将3D本地搜索与2D全球搜索集成,以处理更大的区域,并在一个25m$^2$的展览区域展示了 resulting 系统。

URL

https://arxiv.org/abs/2303.03178

PDF

https://arxiv.org/pdf/2303.03178.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot