Paper Reading AI Learner

Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

2026-01-22 18:59:13
Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Jinwoo Choi

Abstract

We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.

Abstract (translated)

我们研究了组合视频理解(CVU),在这种情况下,模型必须识别动词和物体,并将它们组合起来以推广到未见过的组合。我们发现现有的零样本组合动作识别(ZS-CAR)模型主要由于一个被忽略的问题模式而失败:基于对象的动词捷径。通过系统的分析,我们展示了这种行为是由两个相互交织的因素引起的:组成监督的高度稀疏性和偏斜性,以及动词和物体之间的不对称学习难度。随着训练的进行,现有的ZS-CAR模型越来越忽视视觉证据,并过度适应共现统计信息。因此,现有模型无法获得在未见过的动词-物体组合中的组合识别益处。 为了解决这个问题,我们提出了RCORE,这是一个简单而有效的框架,强制执行基于时间的基础动词学习。RCORE引入了(i)一种组合感知增强方法,可以在不破坏运动线索的情况下多样化动词-对象组合;(ii)一种时间顺序正则化损失,通过显式建模时间结构来惩罚捷径行为。在两个基准测试Sth-com和我们新构建的EK100-com上,RCORE显著提高了未见过组合的准确性,减少了对共现偏差的依赖,并实现了持续的正面组合差距。 我们的发现揭示了基于对象的捷径作为ZS-CAR中的关键限制因素,并证明解决这些问题对于稳健的组合视频理解至关重要。

URL

https://arxiv.org/abs/2601.16211

PDF

https://arxiv.org/pdf/2601.16211.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot