Paper Reading AI Learner

An Evolutionary Network Architecture Search Framework with Adaptive Multimodal Fusion for Hand Gesture Recognition

2024-03-27 02:39:23
Yizhang Xia, Shihao Song, Zhanglu Hou, Junwen Xu, Juan Zou, Yuan Liu, Shengxiang Yang

Abstract

Hand gesture recognition (HGR) based on multimodal data has attracted considerable attention owing to its great potential in applications. Various manually designed multimodal deep networks have performed well in multimodal HGR (MHGR), but most of existing algorithms require a lot of expert experience and time-consuming manual trials. To address these issues, we propose an evolutionary network architecture search framework with the adaptive multimodel fusion (AMF-ENAS). Specifically, we design an encoding space that simultaneously considers fusion positions and ratios of the multimodal data, allowing for the automatic construction of multimodal networks with different architectures through decoding. Additionally, we consider three input streams corresponding to intra-modal surface electromyography (sEMG), intra-modal accelerometer (ACC), and inter-modal sEMG-ACC. To automatically adapt to various datasets, the ENAS framework is designed to automatically search a MHGR network with appropriate fusion positions and ratios. To the best of our knowledge, this is the first time that ENAS has been utilized in MHGR to tackle issues related to the fusion position and ratio of multimodal data. Experimental results demonstrate that AMF-ENAS achieves state-of-the-art performance on the Ninapro DB2, DB3, and DB7 datasets.

Abstract (translated)

基于多模态数据的双手势识别(HGR)引起了相当大的关注,因为其在应用领域具有很大的潜力。 various手工设计的多模态深度网络在多模态HGR(MHGR)表现良好,但现有的算法需要很多专家经验和费时费力的手动尝试。为了解决这些问题,我们提出了一个基于自适应多模态融合(AMF-ENAS)的进化网络架构搜索框架。具体来说,我们设计了一个编码空间,同时考虑多模态数据的融合位置和比值,允许通过解码自动构建具有不同架构的 multimodal 网络。此外,我们考虑三个输入流,分别是对模态表面电生理(sEMG)、对模态加速度计(ACC)和跨模态 sEMG-ACC。为了自动适应各种数据集,ENAS 框架被设计为自动搜索具有适当融合位置和比值的 MHGR 网络。据我们所知,这是 ENAS 首次用于解决多模态数据融合位置和比值的问题。实验结果表明,AMF-ENAS 在 Ninapro DB2、DB3 和 DB7 数据集上取得了最先进的性能。

URL

https://arxiv.org/abs/2403.18208

PDF

https://arxiv.org/pdf/2403.18208.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot