How to Tidy Up a Table: Fusing Visual and Semantic Commonsense Reasoning for Robotic Tasks with Vague Objectives

Abstract
Abstract (translated)
URL
PDF

Abstract

Vague objectives in many real-life scenarios pose long-standing challenges for robotics, as defining rules, rewards, or constraints for optimization is difficult. Tasks like tidying a messy table may appear simple for humans, but articulating the criteria for tidiness is complex due to the ambiguity and flexibility in commonsense reasoning. Recent advancement in Large Language Models (LLMs) offers us an opportunity to reason over these vague objectives: learned from extensive human data, LLMs capture meaningful common sense about human behavior. However, as LLMs are trained solely on language input, they may struggle with robotic tasks due to their limited capacity to account for perception and low-level controls. In this work, we propose a simple approach to solve the task of table tidying, an example of robotic tasks with vague objectives. Specifically, the task of tidying a table involves not just clustering objects by type and functionality for semantic tidiness but also considering spatial-visual relations of objects for a visually pleasing arrangement, termed as visual tidiness. We propose to learn a lightweight, image-based tidiness score function to ground the semantically tidy policy of LLMs to achieve visual tidiness. We innovatively train the tidiness score using synthetic data gathered using random walks from a few tidy configurations. Such trajectories naturally encode the order of tidiness, thereby eliminating the need for laborious and expensive human demonstrations. Our empirical results show that our pipeline can be applied to unseen objects and complex 3D arrangements.

Abstract (translated)

在许多实际场景中,许多目标变得模糊,给机器人带来了长期的挑战,因为定义规则、奖励或限制进行优化非常困难。例如,整理一个混乱的桌子可能会对人类来说看起来很简单,但描述整理的标准因为常识推理中的歧义和灵活性而非常复杂。最近,大型语言模型(LLM)的发展为我们提供了解决这个问题的机会:从广泛的人类数据中学习,LLM能够捕捉对人类行为有意义的常识推理。然而,由于LLM仅从语言输入中训练,它们可能会与机器人任务遇到困难,因为它们没有足够的能力处理感知和低级别控制。在这项工作中,我们提出了一种简单的方法来解决桌子整理任务,这是一个模糊目标机器人任务的示例。具体来说,整理桌子的任务不仅涉及按类型和功能将对象分组以实现语义整洁,而且还考虑空间-视觉关系,以创建一个视觉效果良好的排列,称为视觉整洁。我们建议学习一种轻量级的基于图像的整洁得分函数,以 ground LLM 语义整洁的政策,实现视觉整洁。我们创新性地使用合成数据从几个整洁配置中收集,通过随机漫步方式生成这些路径,这些路径自然编码整洁的顺序,从而消除了繁琐的人类演示的需求。我们的实验结果表明,我们的管道可以应用于未知的对象和复杂的三维布局。

URL

https://arxiv.org/abs/2307.11319

PDF

https://arxiv.org/pdf/2307.11319.pdf

How to Tidy Up a Table: Fusing Visual and Semantic Commonsense Reasoning for Robotic Tasks with Vague Objectives

Abstract

Abstract (translated)

URL

PDF Copy

PDF