Paper Reading AI Learner

Combating Missing Modalities in Egocentric Videos at Test Time

2024-04-23 16:01:33
Merey Ramazanova, Alejandro Pardo, Bernard Ghanem, Motasem Alfarra

Abstract

Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.

Abstract (translated)

理解包含多种模态的视频对来说非常重要,尤其是在自闭型视频中,将各种感官输入结合起来可以显著提高诸如动作识别和时刻定位等任务。然而,由于隐私问题、效率需求或硬件问题等原因,现实世界的应用经常面临模态不完整的情况。尽管现有的方法非常有效,但通常需要重新训练整个模型来处理缺失的模态,这使得它们在计算上是密集的,尤其是在大型训练数据集的情况下。在本文中,我们提出了一种在测试时不需要重新训练的方法来解决这个问题。我们将问题建模为测试时的自适应任务,在这个任务中,模型根据测试时的未标注数据进行调整。我们的方法MiDl~( mutual information with self-distillation)通过最小化预测和可用模态之间的互信息来鼓励模型对测试时的具体模态保持鲁棒性。此外,我们还将自监督学习集成到模型中,以便在模态都存在时保持模型的原始性能。MiDl是第一个在测试时专门处理缺失模态的自监督在线解决方案。通过使用各种预训练模型和数据集进行实验,MiDl证明了在无需重新训练的情况下具有显著的性能提升。

URL

https://arxiv.org/abs/2404.15161

PDF

https://arxiv.org/pdf/2404.15161.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot