Paper Reading AI Learner

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

2023-05-12 03:05:40
Xinyu Gong, Sreyas Mohan, Naina Dhingra, Jean-Charles Bazin, Yilei Li, Zhangyang Wang, Rakesh Ranjan

Abstract

In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at this https URL.

Abstract (translated)

在本文中,我们研究了一种 egocentric 行动识别中的新颖问题,我们称之为 "Multimodal Generalization" (MMG)。 MMG 的目标是研究在特定modality(例如视觉、听觉和惯性运动传感器)数据限制或完全缺失的情况下,系统如何能够泛化。我们在标准监督行动识别上下文中,以及更困难的单Shot(即一次操作中只读取一个数据点)情况下,对 MMG 进行了深入研究。 MMG 由两个新的场景组成,旨在支持在现实应用中的安全和效率考虑:(1)缺失modality 泛化,即在推断时间缺失某些modality,(2)跨modal 零Shot 泛化,即在推断时间和训练时间互相分离的modality。 为进行此研究,我们创建了一个包含视频、音频和惯性运动传感器(IMU)modality 的数据集 MMG-Ego4D。我们的数据集从 Ego4D 数据集中提取,但由人类专家进行处理和彻底重新注释,以协助研究 MMG 问题。我们对 MMG-Ego4D 上各种模型进行了评估,并提出了新方法,提高了泛化能力。特别是,我们引入了一个modality dropout 训练、Contrastive based 对齐训练和一种新的跨modal 原型损失的新 fusion 模块,以改善单Shot 性能。我们希望本研究将成为跨modality 泛化问题的基准,并指导未来的跨modal 泛化研究。基准和代码将在这个 https URL 上可用。

URL

https://arxiv.org/abs/2305.07214

PDF

https://arxiv.org/pdf/2305.07214.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot