Abstract
Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
Abstract (translated)
理解视频内在地需要对视觉和听觉信息进行推理。为了全面评估能够处理包括视觉和音频在内的多模态信息的全知大型语言模型(Omni-LLM),一个有效的基准测试必须全面涵盖三个方面:(1) 多模态依赖性(即,仅靠视觉或音频无法回答的问题);(2) 丰富的音频信息类型(如语音、声音事件等);以及 (3) 不同的场景跨度。然而,现有的数据集在这几个维度上存在不足,限制了严格的全面评估。为弥补这一空白,我们引入了一个新的基准测试——JointAVBench,它具有严格的音视频关联,并涵盖五个认知层面、四种音频信息类型(语音、声音事件、音乐和声乐特征)以及三种场景跨度(单场景、跨场景和全场景)。考虑到手动注释成本高昂,我们提出了一条自动化流程,利用最先进的视觉LLM、音频LLM和通用型LLM来合成严格要求联合音视频理解的问题与答案。我们在该数据集上对仅基于视觉的模型、仅基于音频的模型以及全知LLM进行了评估。结果显示,即使表现最好的全知LLM也只达到了62.6%的平均准确率,在超过单场景推理的情景下尤其明显,这虽然超过了单一模态基线的表现,但也揭示了显著改进的空间。
URL
https://arxiv.org/abs/2512.12772