Paper Reading AI Learner

JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation

2025-12-14 17:23:21
Jianghan Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song, Liyun Ru

Abstract

Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 62.6\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.

Abstract (translated)

理解视频内在地需要对视觉和听觉信息进行推理。为了全面评估能够处理包括视觉和音频在内的多模态信息的全知大型语言模型(Omni-LLM),一个有效的基准测试必须全面涵盖三个方面:(1) 多模态依赖性(即,仅靠视觉或音频无法回答的问题);(2) 丰富的音频信息类型(如语音、声音事件等);以及 (3) 不同的场景跨度。然而,现有的数据集在这几个维度上存在不足,限制了严格的全面评估。为弥补这一空白,我们引入了一个新的基准测试——JointAVBench,它具有严格的音视频关联,并涵盖五个认知层面、四种音频信息类型(语音、声音事件、音乐和声乐特征)以及三种场景跨度(单场景、跨场景和全场景)。考虑到手动注释成本高昂,我们提出了一条自动化流程,利用最先进的视觉LLM、音频LLM和通用型LLM来合成严格要求联合音视频理解的问题与答案。我们在该数据集上对仅基于视觉的模型、仅基于音频的模型以及全知LLM进行了评估。结果显示,即使表现最好的全知LLM也只达到了62.6%的平均准确率,在超过单场景推理的情景下尤其明显,这虽然超过了单一模态基线的表现,但也揭示了显著改进的空间。

URL

https://arxiv.org/abs/2512.12772

PDF

https://arxiv.org/pdf/2512.12772.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot