Abstract
In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at this https URL.
Abstract (translated)
在本文中,我们研究了一种 egocentric 行动识别中的新颖问题,我们称之为 "Multimodal Generalization" (MMG)。 MMG 的目标是研究在特定modality(例如视觉、听觉和惯性运动传感器)数据限制或完全缺失的情况下,系统如何能够泛化。我们在标准监督行动识别上下文中,以及更困难的单Shot(即一次操作中只读取一个数据点)情况下,对 MMG 进行了深入研究。 MMG 由两个新的场景组成,旨在支持在现实应用中的安全和效率考虑:(1)缺失modality 泛化,即在推断时间缺失某些modality,(2)跨modal 零Shot 泛化,即在推断时间和训练时间互相分离的modality。 为进行此研究,我们创建了一个包含视频、音频和惯性运动传感器(IMU)modality 的数据集 MMG-Ego4D。我们的数据集从 Ego4D 数据集中提取,但由人类专家进行处理和彻底重新注释,以协助研究 MMG 问题。我们对 MMG-Ego4D 上各种模型进行了评估,并提出了新方法,提高了泛化能力。特别是,我们引入了一个modality dropout 训练、Contrastive based 对齐训练和一种新的跨modal 原型损失的新 fusion 模块,以改善单Shot 性能。我们希望本研究将成为跨modality 泛化问题的基准,并指导未来的跨modal 泛化研究。基准和代码将在这个 https URL 上可用。
URL
https://arxiv.org/abs/2305.07214