Paper Reading AI Learner

X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization

2024-03-28 19:45:35
Anna Kukleva, Fadime Sener, Edoardo Remelli, Bugra Tekin, Eric Sauser, Bernt Schiele, Shugao Ma

Abstract

Lately, there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However, the adaptation of these models to egocentric videos has been largely unexplored. To address this gap, we propose a simple yet effective cross-modal adaptation framework, which we call X-MIC. Using a video adapter, our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for fine-grained cross-dataset action generalization, demonstrating the effectiveness of our method. Code is available at this https URL

Abstract (translated)

近年来,由于其在零散拍摄识别方面的成功,人们对将视觉语言模型(VLMs)应用于图像和第三人称视频分类产生了浓厚兴趣。然而,将这些模型应用于以自我为中心的视频领域仍然是一个未被充分探索的领域。为了填补这一空白,我们提出了一个简单而有效的跨模态适应框架,我们称之为X-MIC。通过使用视频适配器,我们的管道学会了将冻结的文本嵌入对齐到每个以自我为中心的视频直接在共享嵌入空间中。我们新颖的适配器架构通过解开可学习的时间建模和冻结的视觉编码器的 learnable temporal modeling,保留并提高了预训练VLMs的泛化能力。这导致文本嵌入与每个以自我为中心的视频之间的增强对齐,从而在跨数据集通用性上取得了显著的提高。我们在Epic-Kitchens,Ego4D和EGTEA数据集上对细粒度跨数据集动作通用性进行评估,证明了我们的方法的的有效性。代码可以从该链接获得:

URL

https://arxiv.org/abs/2403.19811

PDF

https://arxiv.org/pdf/2403.19811.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot