Paper Reading AI Learner

Active Multimodal Distillation for Few-shot Action Recognition

2025-06-16 10:10:56
Weijia Feng, Yichen Zhu, Ruojia Zhang, Chenyang Wang, Fei Ma, Xiaobao Wang, Xiaobai Li

Abstract

Owing to its rapid progress and broad application prospects, few-shot action recognition has attracted considerable interest. However, current methods are predominantly based on limited single-modal data, which does not fully exploit the potential of multimodal information. This paper presents a novel framework that actively identifies reliable modalities for each sample using task-specific contextual cues, thus significantly improving recognition performance. Our framework integrates an Active Sample Inference (ASI) module, which utilizes active inference to predict reliable modalities based on posterior distributions and subsequently organizes them accordingly. Unlike reinforcement learning, active inference replaces rewards with evidence-based preferences, making more stable predictions. Additionally, we introduce an active mutual distillation module that enhances the representation learning of less reliable modalities by transferring knowledge from more reliable ones. Adaptive multimodal inference is employed during the meta-test to assign higher weights to reliable modalities. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing approaches.

Abstract (translated)

由于其迅速的发展和广泛的应用前景,少样本动作识别吸引了大量的关注。然而,当前的方法主要依赖于有限的单模态数据,未能充分利用多模态信息的潜力。本文提出了一种新颖的框架,该框架能够利用特定任务相关的上下文线索主动识别每个样本中的可靠模态,从而显著提高识别性能。我们的框架整合了一个积极样本推理(Active Sample Inference, ASI)模块,通过使用基于后验分布的积极推断来预测可靠的模态,并随后根据这些模态进行组织。与强化学习不同的是,积极推断用基于证据的偏好取代了奖励机制,从而能够做出更加稳定的预测。 此外,我们还引入了一个主动互蒸馏(active mutual distillation)模块,通过从更可靠的模态中转移知识来增强不那么可靠模态的表现学习能力。在元测试阶段采用自适应多模态推理技术,给定更多的权重于可靠的模态上。多项跨多个基准的实验表明,我们的方法显著优于现有方法。 总的来说,该研究提出了一种新颖的方法,在少样本动作识别领域利用多模态信息提高了模型性能,并展示了其优越性和潜力。

URL

https://arxiv.org/abs/2506.13322

PDF

https://arxiv.org/pdf/2506.13322.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot