Paper Reading AI Learner

Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts

2024-01-21 05:50:39
Kiyoon Kim, Shreyank N Gowda, Panagiotis Eustratiadis, Antreas Antoniou, Robert B Fisher

Abstract

Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class from the target evaluation dataset's training data (i.e. class prototype) and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of each target class. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment.

Abstract (translated)

尽管在视频动作识别领域最近取得了在现有基准测试中实现强劲性能的进步,但这些模型在面临训练和测试数据之间的自然分布差异时通常缺乏鲁棒性。我们提出了两种新的评估方法来评估模型对这种分布不灵性的鲁棒性。一种方法使用来自不同来源的两个不同的数据集,并将其中一个用于训练和验证,另一个用于测试。更具体地说,我们使用训练和测试数据中重叠的类别的子集来创建HMDB-51或UCF-101的数据集划分,并将Kinetics-400用于测试。另一种方法从目标评估数据集的训练数据中提取每个类的特征均值,并估计测试视频对每个目标类别的余弦相似分数。这个过程不使用目标数据集上的模型权重,并且不需要对两个不同数据集中的重叠类别进行对齐,因此是一种非常有效的测试模型对分布不灵性的方法,而不需要先验知识 of the target distribution。我们通过 adversarial augmentation training - 对 augmentation parameters 应用梯度上升方法,生成对分类模型来说“困难”的视频的增强视图 - 以及 "曲线" 地安排视频增强的强度来解决鲁棒性问题。我们通过实验证明了所提出的 adversarial augmentation 方法在三个最先进的动作识别模型 - TSM,Video Swin Transformer 和 Uniformer - 上的优越性能。本研究提供了对模型对分布不灵性的关键洞察,并为实际部署场景中提高视频动作识别性能提供了有效的技术。

URL

https://arxiv.org/abs/2401.11406

PDF

https://arxiv.org/pdf/2401.11406.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot