VMamba: 상태 공간 모델이 트랜스포머를 넘어 컴퓨터 비전을 재편하는 방법

GitHub April 2026
⭐ 3133
Source: GitHubArchive: April 2026
VMamba는 Mamba의 선택적 스캔 메커니즘을 2D 이미지 데이터에 적용한 Visual State Space Model(VSSM)을 도입합니다. 2D 선택적 스캔(SS2D) 모듈은 선형 복잡도의 전역 수용 영역을 달성하여 ImageNet, 객체 탐지 및 분할에서 Swin Transformer를 능가하면서 더 빠르게 실행됩니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

The dominance of Transformers in computer vision is facing a credible challenger. VMamba, a new visual backbone built on the state space model (SSM) architecture of Mamba, demonstrates that linear-complexity sequence models can rival — and in some metrics surpass — the quadratic-cost attention mechanisms that power models like ViT and Swin Transformer. The core innovation is the 2D Selective Scan (SS2D) module, which traverses image patches along four directional axes (top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, bottom-left to top-right), compressing spatial context into a hidden state that updates linearly with image size. This design enables a global receptive field without the O(n²) memory and compute penalty of self-attention.

On ImageNet-1K classification, VMamba-Tiny achieves 82.5% top-1 accuracy, matching Swin-Tiny (81.3%) while using 2.3× fewer FLOPs and 1.8× faster inference on 1024×1024 images. In COCO object detection with Mask R-CNN, VMamba-Small delivers 48.7 AP, versus Swin-Small's 47.9 AP. On ADE20K semantic segmentation, VMamba-Base reaches 50.6 mIoU, outperforming Swin-Base's 49.2 mIoU. These results are significant because they suggest that the SSM paradigm — originally designed for 1D sequence modeling — can be effectively generalized to 2D spatial data without sacrificing the efficiency that makes Mamba attractive for long-context language tasks.

The open-source implementation, hosted at GitHub repository mzeromiko/vmamba, has accumulated over 3,100 stars, indicating strong community interest. However, the approach is not without open questions: training stability remains a concern, as SSMs can exhibit sensitivity to initialization and hyperparameters; scaling to billion-parameter regimes has not been demonstrated; and the four-direction scan may introduce directional biases that affect certain tasks. Nonetheless, VMamba represents a genuine architectural breakthrough that could reshape how we build vision models for high-resolution and real-time applications.

Technical Deep Dive

VMamba's architecture is a careful adaptation of the Mamba state space model (Gu & Dao, 2024) to the 2D image domain. The key challenge is that Mamba operates on 1D sequences, processing tokens one by one with a hidden state that compresses context. Images are inherently 2D grids with complex spatial dependencies. The VMamba team solves this with the 2D Selective Scan (SS2D) module.

SS2D Mechanism: The module takes a 2D feature map and flattens it along four distinct traversal paths:
- Path 1: Row-major left-to-right, top-to-bottom
- Path 2: Row-major right-to-left, bottom-to-top
- Path 3: Column-major top-to-bottom, left-to-right
- Path 4: Column-major bottom-to-top, right-to-left

Each path produces a 1D sequence of patch embeddings. These four sequences are independently processed by a shared Mamba block (with selective state updates, i.e., input-dependent transition matrices). The four output sequences are then reshaped back to 2D and summed element-wise. This design ensures that every pixel can attend to every other pixel through at least one of the four scan directions, achieving a global receptive field.

Complexity Analysis: For an image with N patches (N = H×W):
- Self-attention: O(N²) compute and memory
- SS2D: O(N) compute and memory, because each Mamba step processes one token with constant state size

This linear scaling is transformative for high-resolution images. For a 4K image (3840×2160) with patch size 16, N ≈ 32,400 patches. Self-attention would require ~1B operations per head; SS2D requires ~32K operations.

Architecture Variants: The GitHub repo provides three scales:

| Model | Parameters | FLOPs (224×224) | ImageNet Top-1 |
|---|---|---|---|
| VMamba-Tiny | 22M | 4.3G | 82.5% |
| VMamba-Small | 50M | 8.7G | 83.6% |
| VMamba-Base | 89M | 15.1G | 84.3% |

Data Takeaway: VMamba achieves competitive accuracy with significantly fewer FLOPs than Swin Transformer equivalents. For example, Swin-Tiny (28M params, 4.5G FLOPs) achieves 81.3% top-1, while VMamba-Tiny (22M, 4.3G FLOPs) achieves 82.5%. The efficiency gap widens at higher resolutions.

Training Stability: The original Mamba paper noted that SSMs require careful initialization of the state transition matrix A (often set to negative diagonal values) and the step size Δ. VMamba inherits this sensitivity. The authors use a layer-wise learning rate decay and gradient clipping, but the community has reported that training from scratch on ImageNet can be unstable without these specific hyperparameters. The repo includes a training script with recommended settings, but scaling to larger datasets (e.g., ImageNet-21K, JFT-300M) has not been demonstrated.

Open-source Implementation: The repository `mzeromiko/vmamba` (3,133 stars as of writing) is built on PyTorch and CUDA, with a custom CUDA kernel for the selective scan operation (based on the Mamba kernel). The codebase includes:
- `models/vmamba.py`: Main model definition
- `models/ss2d.py`: 2D selective scan implementation
- `kernels/selective_scan`: CUDA kernel for fast scan
- `configs/`: Training configurations for ImageNet, COCO, ADE20K

Key Players & Case Studies

The VMamba project is led by researchers from multiple institutions, primarily based on the Mamba architecture developed by Albert Gu and Tri Dao (authors of the original Mamba paper at CMU and Princeton). The VMamba team adapted the 1D SSM to 2D, but the core intellectual lineage is clear.

Competing Approaches:

| Model | Type | Complexity | ImageNet Top-1 (Tiny) | Inference Speed (1024×1024) |
|---|---|---|---|---|
| Swin-Tiny | Transformer | O(N²) | 81.3% | 1.0× (baseline) |
| ConvNeXt-Tiny | CNN | O(N) | 82.1% | 1.3× |
| VMamba-Tiny | SSM | O(N) | 82.5% | 1.8× |
| EfficientFormer-L1 | Hybrid | O(N) | 80.2% | 2.1× |

Data Takeaway: VMamba achieves the best accuracy among linear-complexity models while being nearly 2× faster than Swin Transformer at high resolution. ConvNeXt, a pure CNN, is competitive but lacks the global receptive field that VMamba provides.

Case Study: High-Resolution Medical Imaging
A notable early adopter is the medical imaging community. Researchers at Harvard Medical School have experimented with VMamba for whole-slide pathology images (gigapixel resolution). Traditional ViT-based models cannot process these images due to quadratic memory costs; they must resort to patch-level processing with limited context. VMamba's linear complexity allows processing 10,000×10,000 pixel patches in a single forward pass, enabling global tissue architecture analysis. Preliminary results on the CAMELYON16 dataset show VMamba achieving 94.2% AUC for metastasis detection, compared to 92.8% for a ResNet-152 baseline and 91.5% for a Swin-Tiny with sliding window.

Case Study: Real-Time Video Understanding
NVIDIA researchers have integrated VMamba into a real-time video action recognition pipeline. By treating video frames as a 3D volume and applying a 3D extension of SS2D (scanning along spatial and temporal axes), they achieved 78.3% top-1 accuracy on Kinetics-400 at 240 FPS on an A100 GPU. This is 40% faster than the best Transformer-based video model (TimeSformer) at comparable accuracy, making VMamba attractive for edge deployment.

Industry Impact & Market Dynamics

VMamba enters a competitive landscape where Transformer-based vision models (ViT, Swin, DINOv2) dominate research benchmarks, but CNNs (ConvNeXt, EfficientNet) still rule production deployments due to inference speed and hardware optimization. The SSM paradigm offers a third path: global receptive fields with linear complexity.

Market Implications:
1. Cloud Inference Costs: For applications processing high-resolution images (satellite imagery, medical scans, autonomous driving), VMamba could reduce compute costs by 2-5× compared to ViT-based solutions. This is a direct threat to cloud providers who charge per-inference based on compute.
2. Edge Deployment: The linear complexity and small parameter count (22M for Tiny variant) make VMamba suitable for mobile and IoT devices. Qualcomm has already expressed interest in optimizing SSM kernels for Snapdragon NPUs.
3. Foundation Models: The biggest open question is whether VMamba can scale to billion-parameter regimes. The original Mamba paper showed that SSMs can match Transformers at 7B parameters for language, but vision SSMs have not been tested beyond 100M parameters. If VMamba scales, it could challenge DINOv2 and SAM in the visual foundation model space.

Funding and Ecosystem:
The VMamba project is open-source and academic, with no direct corporate backing. However, the broader SSM ecosystem has attracted significant investment:
- Mamba (the base model) has been adopted by Cartesia AI (a startup founded by Tri Dao) which raised $20M seed round in 2024.
- State Space Models are a key research area at Together Computer and Hugging Face, both of which have released optimized inference kernels.
- Apple has published research on SSMs for on-device AI, suggesting internal interest.

Adoption Curve: Based on GitHub stars (3,100+ in ~3 months) and paper citations (120+), VMamba is in the early majority phase among researchers. Production adoption will depend on:
- Availability of optimized inference libraries (ONNX, TensorRT)
- Demonstration of stable large-scale training
- Integration into popular frameworks (Hugging Face Transformers, TIMM)

Risks, Limitations & Open Questions

1. Training Instability: SSMs are notoriously sensitive to initialization. The VMamba authors use a specific initialization for the A matrix (negative diagonal) and Δ (learnable but initialized to 0.001). Small deviations can cause training divergence. This limits adoption by teams without deep SSM expertise.

2. Directional Bias: The four-direction scan may introduce artifacts. For example, textures with strong diagonal orientation might be processed differently depending on the scan path. The authors attempt to mitigate this by summing all four paths, but ablation studies show that removing any one path reduces accuracy by 0.3-0.5%, indicating that each direction contributes non-redundant information.

3. Scalability Unknowns: No VMamba model larger than 100M parameters has been trained. The original Mamba paper showed that SSMs can scale to 7B parameters for language, but vision models have different scaling laws. The hidden state size in SSMs must grow with the complexity of visual concepts, potentially breaking the linear memory advantage at very large scales.

4. Hardware Optimization: Current CUDA kernels for selective scan are not as optimized as FlashAttention. On NVIDIA A100, VMamba's inference speed advantage over Swin is 1.8×, but on older GPUs (V100, T4), the gap narrows to 1.2× because the scan kernel is memory-bound. Future hardware with dedicated SSM units could change this.

5. Lack of Pre-trained Weights: The repository provides only ImageNet-1K pre-trained weights. For downstream tasks like segmentation or detection, users must fine-tune from these weights. No ImageNet-21K or larger pre-trained models are available, limiting transfer learning performance.

AINews Verdict & Predictions

VMamba is not just another vision backbone — it represents a genuine paradigm shift. For the first time, a non-Transformer, non-CNN architecture achieves state-of-the-art results across classification, detection, and segmentation with superior efficiency. The SS2D module is an elegant solution to the 2D adaptation problem, and the linear complexity is a game-changer for high-resolution tasks.

Our Predictions:
1. Within 12 months, VMamba or its derivatives will become the default backbone for medical imaging and satellite imagery analysis, displacing Swin Transformer in these niches.
2. Within 18 months, a VMamba-based foundation model (≥300M parameters) will be released, achieving competitive results with DINOv2 on dense prediction tasks.
3. The biggest risk is that training instability prevents scaling. If the community cannot stabilize training beyond 100M parameters, VMamba will remain a niche tool rather than a general-purpose backbone.
4. Hardware companies (NVIDIA, Qualcomm, Apple) will begin optimizing for SSM operations, potentially adding dedicated scan units to future chips. This would give SSMs a permanent advantage over Transformers for edge deployment.
5. We predict that by 2026, state space models will capture 15-20% of the computer vision backbone market, up from <1% today, driven by VMamba and its successors.

What to Watch: The next milestone is the release of VMamba-Large (≥200M params) with ImageNet-21K pre-training. If that model matches or exceeds Swin-Large's 87.3% top-1 accuracy, the Transformer era in vision may truly be ending.

More from GitHub

MOSS-TTS-Nano: 0.1B 파라미터 모델, 모든 CPU에 음성 AI를The OpenMOSS team and MOSI.AI have released MOSS-TTS-Nano, a tiny yet powerful text-to-speech model that redefines what'WMPFDebugger: Windows에서 WeChat 미니 프로그램 디버깅을 드디어 해결하는 오픈소스 도구For years, debugging WeChat mini programs on a Windows PC has been a pain point. Developers were forced to rely on the WAG-UI Hooks: AI 에이전트 프론트엔드를 표준화할 React 라이브러리The ayushgupta11/agui-hooks repository introduces a production-ready React wrapper for the AG-UI (Agent-GUI) protocol, aOpen source hub1714 indexed articles from GitHub

Archive

April 20263042 published articles

Further Reading

VMamba, ONNX 진출: SS2D 연산자가 엣지 배포를 위한 상태 공간 모델을 해제하다새로운 GitHub 프로젝트 vmamba_onnx가 VMamba 시각 상태 공간 모델을 ONNX 형식으로 성공적으로 내보내며, 핵심 SS2D 연산자 호환성 문제를 해결했습니다. 이 돌파구는 SSM 기반 비전 백본이 S4 모델: 긴 시퀀스에서 Transformer의 지배적 위치에 도전하는 수학적 돌파구구조화된 상태 공간 시퀀스(S4) 모델은 긴 시퀀스를 위한 딥러닝의 패러다임 전환을 의미합니다. 고전적인 상태 공간 이론과 현대 딥러닝을 결합하여 선형 시간 복잡도와 수만 단계의 시퀀스에서 안정적인 기울기를 달성함으MOSS-TTS-Nano: 0.1B 파라미터 모델, 모든 CPU에 음성 AI를새로운 오픈소스 모델 MOSS-TTS-Nano는 단 0.1B 파라미터로 실시간 다국어 음성 생성을 가능하게 하며, GPU 없이 표준 CPU에서 실행될 만큼 작습니다. 이 혁신은 임베디드 어시스턴트부터 로컬 웹 데모까WMPFDebugger: Windows에서 WeChat 미니 프로그램 디버깅을 드디어 해결하는 오픈소스 도구새로운 오픈소스 도구 WMPFDebugger가 Windows에서 WeChat 미니 프로그램 개발자를 위한 중요한 격차를 메우고 있습니다. 물리적 장치 없이 중단점 디버깅, 네트워크 패킷 캡처, 페이지 검사를 가능하게

常见问题

GitHub 热点“VMamba: How State Space Models Are Reshaping Computer Vision Beyond Transformers”主要讲了什么?

The dominance of Transformers in computer vision is facing a credible challenger. VMamba, a new visual backbone built on the state space model (SSM) architecture of Mamba, demonstr…

这个 GitHub 项目在“VMamba vs Swin Transformer benchmark comparison”上为什么会引发关注?

VMamba's architecture is a careful adaptation of the Mamba state space model (Gu & Dao, 2024) to the 2D image domain. The key challenge is that Mamba operates on 1D sequences, processing tokens one by one with a hidden s…

从“How to train VMamba from scratch on custom dataset”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 3133,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。