Paper Reading AI Learner

M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer

2025-09-22 16:44:34
Yanxin Zhang (School of Software Northwestern Polytechnical University), Liang He (School of Software Northwestern Polytechnical University), Zeyi Kang (School of Software Northwestern Polytechnical University), Zuheng Ming (Laboratoire L2Tl University Sorbonne Paris Nord), Kaixing Zhao (School of Software Yangtze River Delta Research Institute)

Abstract

In recent years, multimodal learning has become essential in robotic vision and information fusion, especially for understanding human behavior in complex environments. However, current methods struggle to fully leverage the textual modality, relying on supervised pretrained models, which limits semantic extraction in unsupervised robotic environments, particularly with significant modality loss. These methods also tend to be computationally intensive, leading to high resource consumption in real-world applications. To address these challenges, we propose the Multi Modal Mamba Enhanced Transformer (M3ET), a lightweight model designed for efficient multimodal learning, particularly on mobile platforms. By incorporating the Mamba module and a semantic-based adaptive attention mechanism, M3ET optimizes feature fusion, alignment, and modality reconstruction. Our experiments show that M3ET improves cross-task performance, with a 2.3 times increase in pretraining inference speed. In particular, the core VQA task accuracy of M3ET remains at 0.74, while the model's parameter count is reduced by 0.67. Although performance on the EQA task is limited, M3ET's lightweight design makes it well suited for deployment on resource-constrained robotic platforms.

Abstract (translated)

近年来,多模态学习在机器人视觉和信息融合领域变得至关重要,特别是在复杂环境中理解人类行为方面。然而,当前的方法难以充分利用文本模式,通常依赖于监督预训练模型,这限制了无监督机器人环境中的语义提取能力,尤其是在存在显著模态损失的情况下。这些方法还往往计算资源消耗大,在实际应用中造成了高资源需求。为了应对这些挑战,我们提出了多模态曼巴增强变换器(M3ET),这是一种轻量级模型,旨在进行高效的多模态学习,特别是在移动平台上。通过集成曼巴模块和基于语义的自适应注意力机制,M3ET优化了特征融合、对齐以及模态重构。我们的实验表明,与现有方法相比,M3ET在跨任务性能方面有所提升,预训练推理速度提高了2.3倍。特别地,尽管模型参数数量减少了0.67,M3ET的核心视觉问答(VQA)任务的准确率仍然保持在0.74。虽然其在实体关系问答(EQA)任务上的表现有限,但M3ET轻量级的设计使其非常适合部署于资源受限的机器人平台上。

URL

https://arxiv.org/abs/2509.18005

PDF

https://arxiv.org/pdf/2509.18005.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot