Paper Reading AI Learner

How to Tidy Up a Table: Fusing Visual and Semantic Commonsense Reasoning for Robotic Tasks with Vague Objectives

2023-07-21 03:00:31
Yiqing Xu, David Hsu

Abstract

Vague objectives in many real-life scenarios pose long-standing challenges for robotics, as defining rules, rewards, or constraints for optimization is difficult. Tasks like tidying a messy table may appear simple for humans, but articulating the criteria for tidiness is complex due to the ambiguity and flexibility in commonsense reasoning. Recent advancement in Large Language Models (LLMs) offers us an opportunity to reason over these vague objectives: learned from extensive human data, LLMs capture meaningful common sense about human behavior. However, as LLMs are trained solely on language input, they may struggle with robotic tasks due to their limited capacity to account for perception and low-level controls. In this work, we propose a simple approach to solve the task of table tidying, an example of robotic tasks with vague objectives. Specifically, the task of tidying a table involves not just clustering objects by type and functionality for semantic tidiness but also considering spatial-visual relations of objects for a visually pleasing arrangement, termed as visual tidiness. We propose to learn a lightweight, image-based tidiness score function to ground the semantically tidy policy of LLMs to achieve visual tidiness. We innovatively train the tidiness score using synthetic data gathered using random walks from a few tidy configurations. Such trajectories naturally encode the order of tidiness, thereby eliminating the need for laborious and expensive human demonstrations. Our empirical results show that our pipeline can be applied to unseen objects and complex 3D arrangements.

Abstract (translated)

在许多实际场景中,许多目标变得模糊,给机器人带来了长期的挑战,因为定义规则、奖励或限制进行优化非常困难。例如,整理一个混乱的桌子可能会对人类来说看起来很简单,但描述整理的标准因为常识推理中的歧义和灵活性而非常复杂。最近,大型语言模型(LLM)的发展为我们提供了解决这个问题的机会:从广泛的人类数据中学习,LLM能够捕捉对人类行为有意义的常识推理。然而,由于LLM仅从语言输入中训练,它们可能会与机器人任务遇到困难,因为它们没有足够的能力处理感知和低级别控制。在这项工作中,我们提出了一种简单的方法来解决桌子整理任务,这是一个模糊目标机器人任务的示例。具体来说,整理桌子的任务不仅涉及按类型和功能将对象分组以实现语义整洁,而且还考虑空间-视觉关系,以创建一个视觉效果良好的排列,称为视觉整洁。我们建议学习一种轻量级的基于图像的整洁得分函数,以 ground LLM 语义整洁的政策,实现视觉整洁。我们创新性地使用合成数据从几个整洁配置中收集,通过随机漫步方式生成这些路径,这些路径自然编码整洁的顺序,从而消除了繁琐的人类演示的需求。我们的实验结果表明,我们的管道可以应用于未知的对象和复杂的三维布局。

URL

https://arxiv.org/abs/2307.11319

PDF

https://arxiv.org/pdf/2307.11319.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot