Abstract
We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.
Abstract (translated)
我们解决的是具身指代理解问题,这个问题涉及通过指向手势和语言来预测场景中的人所指的是哪个物体。准确地识别出被指代的物体需要多模态的理解能力:结合文本指令、视觉指向以及场景上下文信息。然而,现有的方法往往难以有效地利用视觉线索来进行消歧化处理。我们还观察到,虽然参照物通常与头部至指尖的方向线对齐,但也偶尔会更接近于腕部至指尖的方向线。因此,依赖单一方向线的假设可能会过于简单,并可能导致性能不佳。 为了应对这一挑战,我们提出了一种双模型框架:一个模型从头部至指尖的方向学习,另一个则从腕部至指尖的方向学习。我们进一步引入了这两种线条的高斯射线热图表示方法,并将其用作输入以提供强有力的监督信号,鼓励模型更好地关注指向线索。 为了结合两个模型的优点,我们提出了CLIP-Aware Pointing Ensemble模块,该模块基于CLIP特征执行混合集成操作。此外,我们还提出了一项辅助任务——物体中心预测头,进一步增强参照物的定位能力。 通过在YouRefIt基准数据集上进行广泛实验和分析,我们的方法在0.25 IoU阈值下实现了约4 mAP的性能提升。
URL
https://arxiv.org/abs/2507.21888