Paper Reading AI Learner

A Survey on Visual Mamba

2024-04-24 16:23:34
Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Zi Ye

Abstract

State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision. It begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. Next, we review these vision mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. We further delve into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation, classification, and image registration, etc.), and Remote Sensing visual tasks. We specially introduce general visual tasks from two levels: High/Mid-level vision (e.g., Object detection, Segmentation, Video classification, etc.) and Low-level vision (e.g., Image super-resolution, Image restoration, Visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.

Abstract (translated)

带有选择机制和硬件感知架构的状态空间模型(SSMs),如Mamba,在长序列建模方面最近取得了显著的进展。由于Transformer中自注意力机制的复杂性随着图像尺寸的增加而增加,计算机视觉任务的计算需求也在增加,因此研究人员现在正在探索如何将Mamba适应计算机视觉任务。本文是旨在为计算机视觉领域提供对Mamba模型的深入分析的第一篇全面调查。文章首先探讨了导致Mamba成功的基本概念,包括状态空间模型框架、选择机制和硬件感知设计。接下来,我们通过分类这些视觉Mamba模型为基本模型并使用卷积、递归和注意等技术对其进行改进,来回顾这些模型。我们深入探讨了Mamba在计算机视觉任务中的广泛应用,包括在各种级别视觉处理中的作为骨干的应用。这包括一般视觉任务(如物体检测、分割、分类和图像配准等)、医学视觉任务(如2D/3D分割、分类和图像配准等)和遥感视觉任务。我们特别引入了两个层面的通用视觉任务:高/中级别视觉(如物体检测、分割、视频分类等)和低级别视觉(如图像超分辨率、图像修复、视觉生成等)。我们希望这个努力将在社区中激发更多的兴趣,以解决当前的挑战并进一步将Mamba模型应用于计算机视觉。

URL

https://arxiv.org/abs/2404.15956

PDF

https://arxiv.org/pdf/2404.15956.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot