Paper Reading AI Learner

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

2025-03-24 13:00:25
Nina Shvetsova, Arsha Nagrani, Bernt Schiele, Hilde Kuehne, Christian Rupprecht

Abstract

We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

Abstract (translated)

我们提出了一种新的基于现有视频分类和检索数据集的无偏子集构建的“通过文本描述去偏(Unbiased through Textual Description,简称UTD)”视频基准测试方法,以实现对视频理解能力更为稳健的评估。具体来说,我们解决了当前视频基准可能存在的不同表示偏差问题,例如对象偏差或单帧偏差,这些问题使得仅仅识别出的对象或是仅利用一个帧就足以做出正确的预测。 为了分析并消除这些表示偏差,我们采用了视觉语言模型(VLMs)和大型语言模型(LLMs)。具体做法是为每段视频生成逐帧的文本描述,并过滤掉特定信息(例如,只保留对象信息),然后使用这些描述来检查在三个维度上的表示偏差:1) 概念偏差 - 确定一个特定概念(如物体)是否足以做出预测;2) 时间偏差 - 评估时间信息是否有助于预测;3) 常识与数据集偏差 - 判断零样本推理或数据集相关性是否影响预测。 我们对十二个流行的视频分类和检索数据集进行了系统分析,并为此创建了新的去偏测试集。此外,我们在原始和去偏的测试集上对三十种最先进的视频模型进行基准测试,并分析这些模型中的偏差。为了促进未来更稳健的视频理解基准和模型的发展,我们发布了两个资源:"UTD-descriptions" - 包含每个数据集丰富结构化描述的数据集;以及 "UTD-splits" - 包括对象去偏测试集的数据集。

URL

https://arxiv.org/abs/2503.18637

PDF

https://arxiv.org/pdf/2503.18637.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot