Paper Reading AI Learner

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

2024-04-30 16:44:18
Chun Feng, Joy Hsu, Weiyu Liu, Jiajun Wu

Abstract

3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.

Abstract (translated)

3D视觉 groundeding 是一个具有挑战性的任务,通常需要直接和密集的监督,特别是场景中的每个对象的语义标签。在本文中,我们研究了一个自然监督设置,该设置仅从3D场景和QA对中学习,而之前的工作在这些设置上表现不佳。我们提出了 Language-Regularized Concept Learner (LARC),它使用语言约束作为正则化,显著提高了自然监督设置中神经符号学习者的准确性。我们的方法基于两个核心见解:语言约束(例如,一个单词与其他单词的关系)可以作为对结构化表示的有效正则化;第二个是,我们可以向大型语言模型查询,从中提取这样的约束从语言属性中。我们证明了 LARC 能够提高之前在自然监督3D视觉 groundeding 中的工作的性能,并展示了广泛的3D视觉推理能力-从零散的组合到数据效率和可转移性。我们的方法代表了一个有前景的步骤,将语言基于先验的视觉推理框架 regularize,以在缺乏密集监督的学习环境中学习。

URL

https://arxiv.org/abs/2404.19696

PDF

https://arxiv.org/pdf/2404.19696.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot