Abstract
Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation. We publicly release the code and dataset at this https URL.
Abstract (translated)
与现实场景中杂乱的交互对机器人代理来说,理解观察对象之间的复杂空间依赖关系以确定最优的抓取序列或有效的对象检索策略面临着挑战。现有的解决方案通常处理简化场景,并专注于在初始物体检测阶段预测成对物体关系,但往往忽视全局上下文或者在处理冗余或缺失的物体关系方面遇到困难。在这项工作中,我们提出了一个现代的视觉关系推理 grasp planning 的视角。我们引入了 D3GD,一种包含 97 个不同类别的 35 个物体的 bin 选择场景。此外,我们提出了 D3G,一种新的端到端 Transformer-based 依赖关系图生成模型,它同时检测物体并生成表示它们空间关系的邻接矩阵。为了识别标准指标的局限性,我们首次使用关系精度(RPN)对模型性能进行评估,进行了一项广泛的实验基准。所得到的结果使我们将其方法确定为这一任务的最新状态,为未来的机器人操作研究奠定了基础。我们公开发布了这段代码和数据集的 URL。
URL
https://arxiv.org/abs/2409.02035