Abstract
Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class from the target evaluation dataset's training data (i.e. class prototype) and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of each target class. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment.
Abstract (translated)
尽管在视频动作识别领域最近取得了在现有基准测试中实现强劲性能的进步,但这些模型在面临训练和测试数据之间的自然分布差异时通常缺乏鲁棒性。我们提出了两种新的评估方法来评估模型对这种分布不灵性的鲁棒性。一种方法使用来自不同来源的两个不同的数据集,并将其中一个用于训练和验证,另一个用于测试。更具体地说,我们使用训练和测试数据中重叠的类别的子集来创建HMDB-51或UCF-101的数据集划分,并将Kinetics-400用于测试。另一种方法从目标评估数据集的训练数据中提取每个类的特征均值,并估计测试视频对每个目标类别的余弦相似分数。这个过程不使用目标数据集上的模型权重,并且不需要对两个不同数据集中的重叠类别进行对齐,因此是一种非常有效的测试模型对分布不灵性的方法,而不需要先验知识 of the target distribution。我们通过 adversarial augmentation training - 对 augmentation parameters 应用梯度上升方法,生成对分类模型来说“困难”的视频的增强视图 - 以及 "曲线" 地安排视频增强的强度来解决鲁棒性问题。我们通过实验证明了所提出的 adversarial augmentation 方法在三个最先进的动作识别模型 - TSM,Video Swin Transformer 和 Uniformer - 上的优越性能。本研究提供了对模型对分布不灵性的关键洞察,并为实际部署场景中提高视频动作识别性能提供了有效的技术。
URL
https://arxiv.org/abs/2401.11406