Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Abstract
Abstract (translated)
URL
PDF

Abstract

3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU.

Abstract (translated)

3D视觉 grounded (3DVG) 和 3D 密集标注 (3DDC) 是各种 3D 应用程序中的两个关键任务，需要在本地化和视觉语言关系中实现共享和互补信息。因此，现有的方法采用两个阶段的“检测-然后描述/区分”流程，对检测器的性能有很高的依赖，导致性能较低。受到 DETR 的启发，我们提出了一个统一的框架，3DGCPTR，以端到端的方式共同解决这两个不同但密切相关的问题。关键思路是重新考虑 3DVG 模型的提示为基础的局部定位能力。以这种方式，具有良好设计的提示作为输入的 3DVG 模型可以协助 3DDC 任务从提示中提取局部定位信息。在实现方面，我们将一个轻量级字幕头集成到现有的 3DVG 网络中，将字幕提示作为连接，有效利用了现有的 3DVG 模型的固有局部定位能力，从而提高了 3DDC 的能力。这种集成同时对两个任务进行多任务训练，相互提高它们的性能。大量的实验结果证明了这种方法的有效性。具体来说，在 ScanRefer 数据集上，3DGCTR 超越了最先进的 3DDC 方法 4.3% 的 CIDEr@0.5IoU 在 MLE 训练中的性能，并将优于最先进的 3DVG 方法 3.16% 的 Acc@0.25IoU。

URL

https://arxiv.org/abs/2404.11064

PDF

https://arxiv.org/pdf/2404.11064.pdf

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Abstract

Abstract (translated)

URL

PDF Copy

PDF