Abstract
Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.
Abstract (translated)
图像引导的对象装配是一个在计算机视觉领域迅速发展的研究课题。本文介绍了一种新颖的任务:将多视角图(例如,使用从3D对象库绘制的建筑块构建的3D模型)翻译成机器人臂可执行的详细装配指令序列。通过多视角图的目标3D模型的复制,这个任务设计的模型需要解决几个子任务,包括识别用于构建3D模型的各个组件、估计每个组件的几何姿态以及根据物理规则推导出可行的装配顺序。在多视角图像和3D对象之间建立准确的2D-3D对应关系技术上具有挑战性。为了解决这个问题,我们提出了一个端到端的模型,称为神经装配器。这个模型学习了一个对象图,其中每个顶点表示图像中识别出的组件,边指定3D模型的拓扑结构,从而可以推导出装配计划。我们为这个任务建立了基准,并对神经装配器及其替代方案进行了全面的实证评估。我们的实验结果清楚地证明了神经装配器的优越性。
URL
https://arxiv.org/abs/2404.16423