Abstract
Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on \textit{transparency}, they are far from human baselines; and only top models correlate moderately with human \textit{iconicity ratings}. Interestingly, \textit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
Abstract (translated)
语符性(即语言形式与意义之间的相似性)在手语中普遍存在,为视觉基础提供了天然的测试环境。对于视觉-语言模型(VLMs),挑战在于从动态的人体动作而不是静态环境中恢复这样的基本映射关系。我们引入了“视觉语符性挑战”,这是一个新颖的视频基准测试,它将心理语言学测量方法适应到三个任务中以评估VLM:(i) 音韵手语形式预测(例如手势形状、位置),(ii) 透明度(从视觉形式推断意义),以及(iii) 等级语符性评分。我们在零样本和少量样本设置下对荷兰手语的13种最先进的VLM进行了评估,并将其与人类基线进行比较。在“音韵形式预测”任务中,VLM恢复了某些手势形状和位置细节,但总体表现仍然低于人类水平;在透明度方面,它们远远落后于人类基准;只有最顶尖的模型才与人类语符性评分有适度的相关性。有趣的是,“具有更强音韵形式预测能力的模型更能准确地判断人类语符性”,这表明这些模型对视觉基础结构有着共同的敏感性。我们的研究结果验证了这些诊断任务的有效性,并激发了以人为中心的信号和具身学习方法在模态模型中的应用,用于建模语符性和改进视觉基础能力。
URL
https://arxiv.org/abs/2510.08482