Abstract
3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU.
Abstract (translated)
3D视觉 grounded (3DVG) 和 3D 密集标注 (3DDC) 是各种 3D 应用程序中的两个关键任务,需要在本地化和视觉语言关系中实现共享和互补信息。因此,现有的方法采用两个阶段的“检测-然后描述/区分”流程,对检测器的性能有很高的依赖,导致性能较低。受到 DETR 的启发,我们提出了一个统一的框架,3DGCPTR,以端到端的方式共同解决这两个不同但密切相关的问题。关键思路是重新考虑 3DVG 模型的提示为基础的局部定位能力。以这种方式,具有良好设计的提示作为输入的 3DVG 模型可以协助 3DDC 任务从提示中提取局部定位信息。在实现方面,我们将一个轻量级字幕头集成到现有的 3DVG 网络中,将字幕提示作为连接,有效利用了现有的 3DVG 模型的固有局部定位能力,从而提高了 3DDC 的能力。这种集成同时对两个任务进行多任务训练,相互提高它们的性能。大量的实验结果证明了这种方法的有效性。具体来说,在 ScanRefer 数据集上,3DGCTR 超越了最先进的 3DDC 方法 4.3% 的 CIDEr@0.5IoU 在 MLE 训练中的性能,并将优于最先进的 3DVG 方法 3.16% 的 Acc@0.25IoU。
URL
https://arxiv.org/abs/2404.11064