Paper Reading AI Learner

Distilling Knowledge from Language Models for Video-based Action Anticipation

2022-10-12 08:02:11
Sayontan Ghosh, Tanvi Aggarwal, Minh Hoai, Niranjan Balasubramanian

Abstract

Anticipating future actions in a video is useful for many autonomous and assistive technologies. Prior action anticipation work mostly treats this as a vision modality problem, where the models learn the task information primarily from the video features in the target action anticipation datasets. In this work, we propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets. In particular, we leverage pre-trained language models to build a text-modality teacher that is able to predict future actions based on text labels of the past actions extracted from the input video. To further adapt the teacher to the target domain (cooking), we also pretrain the teacher on textual instructions from a recipes dataset (Recipe1M). Then, we distill the knowledge gained by the text-modality teacher into a vision-modality student to further improve it's performance. We empirically evaluate this simple cross-modal distillation strategy on two video datasets EGTEA-GAZE+ and EPIC-KITCHEN 55. Distilling this text-modality knowledge into a strong vision model (Anticipative Vision Transformer) yields consistent gains across both datasets, 3.5% relative improvement on top1 class mean recall for EGTEA-GAZE+, 7.2% on top5 many-shot class mean recall for EPIC-KITCHEN 55 and achieves new state-of-the-results.

Abstract (translated)

URL

https://arxiv.org/abs/2210.05991

PDF

https://arxiv.org/pdf/2210.05991.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot