Paper Reading AI Learner

A Multimodal Intermediate Fusion Network with Manifold Learning for Stress Detection

2024-03-12 21:06:19
Morteza Bodaghi, Majid Hosseini, Raju Gottumukkala

Abstract

Multimodal deep learning methods capture synergistic features from multiple modalities and have the potential to improve accuracy for stress detection compared to unimodal methods. However, this accuracy gain typically comes from high computational cost due to the high-dimensional feature spaces, especially for intermediate fusion. Dimensionality reduction is one way to optimize multimodal learning by simplifying data and making the features more amenable to processing and analysis, thereby reducing computational complexity. This paper introduces an intermediate multimodal fusion network with manifold learning-based dimensionality reduction. The multimodal network generates independent representations from biometric signals and facial landmarks through 1D-CNN and 2D-CNN. Finally, these features are fused and fed to another 1D-CNN layer, followed by a fully connected dense layer. We compared various dimensionality reduction techniques for different variations of unimodal and multimodal networks. We observe that the intermediate-level fusion with the Multi-Dimensional Scaling (MDS) manifold method showed promising results with an accuracy of 96.00\% in a Leave-One-Subject-Out Cross-Validation (LOSO-CV) paradigm over other dimensional reduction methods. MDS had the highest computational cost among manifold learning methods. However, while outperforming other networks, it managed to reduce the computational cost of the proposed networks by 25\% when compared to six well-known conventional feature selection methods used in the preprocessing step.

Abstract (translated)

多模态深度学习方法从多个模态中捕获协同特征,具有改善与单模态方法的准确性相比的压力检测的精度的潜力。然而,这种准确性提升通常来自高维特征空间的计算成本,特别是在中间融合阶段。降维是优化多模态学习的一种方式,通过简化数据并使特征更具处理和分析的灵活性,从而降低计算复杂性。本文介绍了一种基于多维学习基于降维的中间多模态融合网络。多模态网络通过1D-CNN和2D-CNN从生物特征信号和面部关键点生成独立表示。最后,这些特征被融合并输入到另一个1D-CNN层,接着是全连接密集层。我们比较了不同单模态和多模态网络的降维技术。我们观察到,在Leave-One-Subject-Out Cross-Validation(LOSO-CV)范式中,中间级融合与Multi-Dimensional Scaling(MDS)分形方法显示出有希望的结果,其准确率为96.00%。MDS在分形学习方法中具有最高的计算成本。然而,与其他网络相比,它通过与预处理步骤中使用的六个已知传统特征选择方法相比较,将所提出网络的计算成本降低了25%。

URL

https://arxiv.org/abs/2403.08077

PDF

https://arxiv.org/pdf/2403.08077.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot