Paper Reading AI Learner

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

2024-04-17 04:46:27
Yongdong Luo, Haojia Lin, Xiawu Zheng, Yigeng Jiang, Fei Chao, Jie Hu, Guannan Jiang, Songan Zhang, Rongrong Ji

Abstract

3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU.

Abstract (translated)

3D视觉 grounded (3DVG) 和 3D 密集标注 (3DDC) 是各种 3D 应用程序中的两个关键任务,需要在本地化和视觉语言关系中实现共享和互补信息。因此,现有的方法采用两个阶段的“检测-然后描述/区分”流程,对检测器的性能有很高的依赖,导致性能较低。受到 DETR 的启发,我们提出了一个统一的框架,3DGCPTR,以端到端的方式共同解决这两个不同但密切相关的问题。关键思路是重新考虑 3DVG 模型的提示为基础的局部定位能力。以这种方式,具有良好设计的提示作为输入的 3DVG 模型可以协助 3DDC 任务从提示中提取局部定位信息。在实现方面,我们将一个轻量级字幕头集成到现有的 3DVG 网络中,将字幕提示作为连接,有效利用了现有的 3DVG 模型的固有局部定位能力,从而提高了 3DDC 的能力。这种集成同时对两个任务进行多任务训练,相互提高它们的性能。大量的实验结果证明了这种方法的有效性。具体来说,在 ScanRefer 数据集上,3DGCTR 超越了最先进的 3DDC 方法 4.3% 的 CIDEr@0.5IoU 在 MLE 训练中的性能,并将优于最先进的 3DVG 方法 3.16% 的 Acc@0.25IoU。

URL

https://arxiv.org/abs/2404.11064

PDF

https://arxiv.org/pdf/2404.11064.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot