Paper Reading AI Learner

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

2024-04-25 08:53:23
Hongyu Yan, Yadong Mu
       

Abstract

Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Abstract (translated)

图像引导的对象装配是一个在计算机视觉领域迅速发展的研究课题。本文介绍了一种新颖的任务:将多视角图(例如,使用从3D对象库绘制的建筑块构建的3D模型)翻译成机器人臂可执行的详细装配指令序列。通过多视角图的目标3D模型的复制,这个任务设计的模型需要解决几个子任务,包括识别用于构建3D模型的各个组件、估计每个组件的几何姿态以及根据物理规则推导出可行的装配顺序。在多视角图像和3D对象之间建立准确的2D-3D对应关系技术上具有挑战性。为了解决这个问题,我们提出了一个端到端的模型,称为神经装配器。这个模型学习了一个对象图,其中每个顶点表示图像中识别出的组件,边指定3D模型的拓扑结构,从而可以推导出装配计划。我们为这个任务建立了基准,并对神经装配器及其替代方案进行了全面的实证评估。我们的实验结果清楚地证明了神经装配器的优越性。

URL

https://arxiv.org/abs/2404.16423

PDF

https://arxiv.org/pdf/2404.16423.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot