计算机科学与探索2026,Vol.20Issue(1):66-78,13.DOI:10.3778/j.issn.1673-9418.2503061
视觉Mamba:结构、应用与前景
Visual Mamba:Structure,Practice,and Prospects
摘要
Abstract
Traditional convolutional neural networks(CNNs)struggle to model global features due to their limited recep-tive field.Though vision Transformers(ViTs)possess the advantage of sequence modeling,they face the issue of quadratic computational complexity,posing severe computational challenges for image processing.In response,researchers have begun exploring new architectures that combine efficient computation with global perception capabilities.The visual Mamba model,based on state space models(SSMs),enables global context modeling under linear computational complexity while retaining sequence modeling capabilities,marking a new stage in vision modeling based on state space models.This paper elaborates on the basic framework of the visual Mamba block,including its dual residual structure composed of residual modules,2D selective scan(SS2D)modules,and feed-forward networks(FFN).It analyzes the working mech-anisms of cross-scanning,S6 block processing,and cross-fusion within the SS2D module.The visual Mamba model is explored from three aspects:scanning methods,stacking methods,and hybrid architectures.Scanning methods include sequential scanning and dynamic scanning,with a comparative analysis of the advantages and disadvantages of different scanning strategies.Stacking methods are categorized into serial Mamba,parallel Mamba,U-shaped Mamba,and graph Mamba,with a detailed analysis on the network construction logic of each stacking structure and its adaptability in multi-scale feature extraction and long-range dependency modeling.Hybrid architectures focus on fusion forms with CNNs,Trans-formers,and attention mechanisms,including single-module fusion and multi-module collaborative architectures,along with an analysis of the strengths and weaknesses of each model.Through analysis,it is pointed out that the visual Mamba model overcomes the local perception limitation of CNNs and the quadratic computational complexity of Transformers.It outperforms mainstream backbone architectures in visual tasks and demonstrates tremendous potential to become a funda-mental visual backbone.关键词
视觉Mamba/扫描方式/堆叠方式/混合结构Key words
visual Mamba/scanning method/stacking method/hybrid architecture分类
信息技术与安全科学引用本文复制引用
张鑫,智敏,萨茹拉,阿日木扎..视觉Mamba:结构、应用与前景[J].计算机科学与探索,2026,20(1):66-78,13.基金项目
内蒙古自然科学基金(2023MS06009) (2023MS06009)
呼和浩特市基础与应用基础研究项目(2024-Gui-Ji-22).This work was supported by the Natural Science Foundation of Inner Mongolia(2023MS06009),and the Basic and Applied Basic Research Project of Hohhot(2024-Gui-Ji-22). (2024-Gui-Ji-22)