Vision Transformer: Google Research가 컴퓨터 비전에서 10년간의 CNN 지배를 뒤집은 방법

GitHub April 2026
⭐ 12502
Source: GitHubtransformer architectureArchive: April 2026
Google Research의 Vision Transformer(ViT)는 컴퓨터 비전에서 컨볼루션 신경망의 10년간의 지배를 무너뜨렸습니다. 이미지를 패치 시퀀스로 처리하고 순수 Transformer 인코더를 적용함으로써, ViT는 최첨단 분류 성능을 달성합니다——단, 대규모 데이터로 훈련된 경우에만 가능합니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

In June 2021, Google Research published a paper and open-sourced a model that would fundamentally alter the trajectory of computer vision: the Vision Transformer (ViT). For nearly a decade, convolutional neural networks (CNNs) like ResNet, EfficientNet, and ConvNeXt had been the undisputed standard for image understanding. ViT challenged that orthodoxy by demonstrating that a pure Transformer architecture—originally designed for natural language processing—could match and even exceed CNN performance on image classification, provided it was pre-trained on sufficiently large datasets (e.g., JFT-300M with 300 million images).

The core innovation is deceptively simple: ViT splits an image into fixed-size patches (typically 16x16 pixels), flattens and linearly embeds each patch into a vector, adds positional embeddings, and feeds the resulting sequence into a standard Transformer encoder. The model has no convolutional layers, no pooling, and no inductive bias for locality or translation equivariance—it must learn spatial relationships entirely from data.

On GitHub, the official repository (google-research/vision_transformer) has accumulated over 12,500 stars and continues to receive daily contributions. The project's significance extends beyond classification benchmarks: ViT has become the backbone for a new generation of multimodal models (e.g., CLIP, DALL·E, Flamingo) and has inspired hundreds of follow-up works including DeiT, Swin Transformer, and masked autoencoders.

This article examines ViT's technical architecture, its real-world adoption by companies like OpenAI and Meta, the market dynamics it has triggered, and the unresolved challenges that remain. AINews concludes with a clear verdict: ViT did not kill CNNs, but it forced the field to accept that attention is all you need—if you have enough data.

Technical Deep Dive

The Vision Transformer (ViT) represents a radical departure from the convolutional paradigm that dominated computer vision since AlexNet in 2012. At its core, ViT treats an image as a 1D sequence of patches, analogous to how a language model treats a sentence as a sequence of tokens.

Architecture Overview:

1. Patch Embedding: An input image of size H×W×C is divided into N patches of size P×P, where N = (H×W)/P². For a standard 224×224 image with P=16, this yields 196 patches. Each patch is flattened into a vector of length P²·C (for RGB, 16×16×3 = 768) and projected to a D-dimensional embedding via a trainable linear layer.

2. Positional Embeddings: Since the Transformer encoder is permutation-invariant, positional information must be injected. ViT uses learnable 1D positional embeddings added to the patch embeddings. Notably, 2D-aware positional embeddings (e.g., relative positions) showed no significant improvement in the original paper, suggesting the model learns spatial structure from the patch content itself.

3. [CLS] Token: Borrowed from BERT, a special learnable token is prepended to the sequence. Its final hidden state serves as the image representation for classification. An alternative approach (used in later works) is to average all patch outputs.

4. Transformer Encoder: The sequence passes through L layers of multi-headed self-attention (MSA) and MLP blocks, with LayerNorm applied before each block and residual connections after. The original ViT-Base uses L=12, hidden size D=768, and 12 attention heads.

5. Classification Head: A simple MLP (or linear layer) on top of the [CLS] token output produces the final class logits.

Key Technical Insight – Inductive Bias Trade-off:

CNNs have strong inductive biases: locality (convolutional kernels look at small neighborhoods), translation equivariance (a cat is a cat regardless of position), and hierarchical feature learning (edges → textures → parts → objects). ViT has almost none of these—it must learn them from data. This is why ViT underperforms CNNs on small datasets (e.g., ImageNet-1K with 1.2M images) but excels on massive datasets (JFT-300M with 300M images). The model needs sufficient data to overcome the lack of built-in priors.

Performance Benchmarks:

| Model | Parameters | Pretraining Data | ImageNet Top-1 Accuracy | Throughput (images/sec) |
|---|---|---|---|---|
| ViT-B/16 | 86M | ImageNet-21K | 81.8% | 1,120 |
| ViT-L/16 | 307M | ImageNet-21K | 85.2% | 620 |
| ViT-H/14 | 632M | JFT-300M | 88.6% | 310 |
| ResNet-152 | 60M | ImageNet-1K | 78.6% | 1,450 |
| EfficientNet-B7 | 66M | ImageNet-1K | 84.3% | 1,100 |
| ConvNeXt-L | 198M | ImageNet-21K | 87.5% | 890 |

*Data Takeaway: ViT-H/14 achieves the highest accuracy (88.6%) but requires 10x more parameters than ResNet-152 and 2x more than ConvNeXt-L. The throughput penalty is significant—ViT-H/14 processes only 310 images/sec versus 890 for ConvNeXt-L. This reveals the core trade-off: ViT delivers top accuracy at the cost of computational efficiency.*

Open-Source Ecosystem:

The official repository (google-research/vision_transformer) on GitHub provides JAX/Flax implementations. However, the community has ported ViT to PyTorch (e.g., the popular `vit-pytorch` package by Phil Wang, with over 10k stars) and Hugging Face Transformers. Notable follow-up repositories include:

- DeiT (facebookresearch/deit): Data-efficient Image Transformers that use knowledge distillation to train ViT on ImageNet-1K alone, achieving 85.2% accuracy with ViT-B.
- Swin Transformer (microsoft/Swin-Transformer): Introduces hierarchical feature maps and shifted windows, achieving SOTA on object detection and segmentation.
- MAE (facebookresearch/mae): Masked Autoencoders that pre-train ViT by reconstructing masked patches, setting new records on downstream tasks.

Key Players & Case Studies

Google Research (Original Creator):

The ViT paper was authored by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and others at Google Brain. Their strategy was bold: bet that scaling data would unlock Transformer potential in vision. Google has since integrated ViT into its Cloud Vision API and used it as the backbone for multimodal models like PaLI and PaLM-E. The company's JFT-300M dataset remains proprietary, giving Google a competitive advantage in training large ViT variants.

Meta AI (Facebook Research):

Meta has been the most aggressive adopter and improver of ViT. Their contributions include:

- DeiT (2021): Showed that ViT could be trained without massive datasets using distillation, democratizing the architecture.
- MAE (2022): Introduced masked autoencoding for ViT pre-training, achieving 87.8% ImageNet accuracy with ViT-H and setting records on detection/segmentation.
- DINOv2 (2023): Self-supervised ViT that produces high-quality visual features without labels, used in Meta's Segment Anything Model (SAM).

OpenAI:

OpenAI's CLIP model uses a ViT-like vision encoder (though with some modifications) paired with a text encoder for zero-shot classification. DALL·E 2 and DALL·E 3 also use ViT-based encoders for image understanding. OpenAI has not open-sourced their ViT implementations, but their influence on the multimodal trend is undeniable.

Hugging Face:

Hugging Face has integrated ViT into its Transformers library, making it accessible to millions of developers. As of April 2025, the ViT model card has over 2 million monthly downloads, and there are over 500 community-trained ViT variants for specialized tasks (medical imaging, satellite imagery, etc.).

Comparison of ViT-based Models:

| Model | Creator | Key Innovation | ImageNet Acc. | Inference Speed (ms/img) | Open Source |
|---|---|---|---|---|---|
| ViT-H/14 | Google | Pure Transformer scaling | 88.6% | 3.2 | Yes (JAX) |
| DeiT-B | Meta | Distillation from CNN teacher | 85.2% | 1.8 | Yes (PyTorch) |
| Swin-L | Microsoft | Hierarchical windows | 87.3% | 2.1 | Yes (PyTorch) |
| MAE-H | Meta | Masked autoencoding | 87.8% | 3.0 | Yes (PyTorch) |
| ConvNeXt-L | Meta | Modernized CNN | 87.5% | 1.1 | Yes (PyTorch) |

*Data Takeaway: ConvNeXt-L, a modernized CNN, offers the best speed-accuracy trade-off (87.5% at 1.1 ms/img). ViT-H/14 is 3x slower for only 1.1% gain. This explains why many production systems still prefer CNNs or hybrid architectures for latency-sensitive applications.*

Industry Impact & Market Dynamics

ViT's introduction triggered a paradigm shift that reshaped the competitive landscape of computer vision:

1. Funding & Investment:

Venture capital in vision-AI startups surged after ViT's success. In 2022-2024, over $4.2 billion was invested in companies building vision Transformers or multimodal models. Notable rounds:

- Hugging Face: $395M Series D (2022), valuation $4.5B—partially driven by ViT adoption.
- Stability AI: $101M seed (2022)—their Stable Diffusion uses a ViT-based encoder.
- Midjourney: Bootstrapped to $200M ARR—uses custom ViT variants for image generation.

2. Market Shift from CNNs to Transformers:

| Year | % of CV Papers Using Transformers | % of CV Papers Using CNNs | Dominant Architecture |
|---|---|---|---|
| 2020 | 5% | 90% | ResNet, EfficientNet |
| 2021 | 25% | 70% | ViT, DeiT |
| 2022 | 45% | 50% | Swin, MAE |
| 2023 | 60% | 35% | ViT, DINOv2 |
| 2024 | 70% | 25% | ViT, multimodal hybrids |

*Data Takeaway: In just four years, Transformers went from a niche experiment to the dominant architecture in computer vision research. However, CNNs remain strong in production due to their efficiency and maturity.*

3. Multimodal Convergence:

ViT's sequence-based representation made it trivial to combine vision with language. This enabled the explosion of multimodal models:

- CLIP (OpenAI): 400M image-text pairs, zero-shot classification.
- Flamingo (DeepMind): Few-shot visual question answering.
- GPT-4V (OpenAI): Vision capabilities built on a ViT-like encoder.
- Gemini (Google): Multimodal from the ground up, using ViT variants.

The market for multimodal AI is projected to grow from $1.2B in 2023 to $12.8B by 2028 (CAGR 60%). ViT is the foundational backbone for this growth.

4. Hardware Adaptation:

NVIDIA, AMD, and Google have optimized their hardware for Transformer workloads. The A100 and H100 GPUs include dedicated Transformer engines that accelerate attention computations. Google's TPU v4 and v5 are designed with ViT-like models in mind. This hardware-software co-evolution has reduced ViT's inference cost by 40% year-over-year.

Risks, Limitations & Open Questions

1. Data Hunger:

ViT's Achilles' heel remains its insatiable appetite for data. On ImageNet-1K (1.2M images), ViT-B/16 achieves only 77.9% accuracy versus ResNet-152's 78.6%. It requires 10-100x more data to match CNN performance. This limits adoption in domains with scarce labeled data (e.g., medical imaging, rare species classification).

2. Computational Cost:

The quadratic complexity of self-attention (O(N²) for N patches) makes ViT expensive for high-resolution images. A 512×512 image with 16×16 patches yields 1,024 tokens—attention scales quadratically. Solutions like Swin Transformer's windowed attention and Linformer's low-rank approximations exist but add complexity.

3. Lack of Inductive Bias:

While ViT's flexibility is a strength, it also means the model can learn spurious correlations. Studies show ViT is more sensitive to adversarial perturbations than CNNs and can be fooled by high-frequency noise that humans ignore. The absence of built-in locality makes it harder to generalize to out-of-distribution data.

4. Interpretability:

ViT's attention maps are often cited as more interpretable than CNN activation maps, but recent work (e.g., by Binder et al., 2023) shows that attention weights do not necessarily correspond to feature importance. The model's reasoning remains opaque, which is problematic in regulated industries like healthcare and autonomous driving.

5. The CNN Comeback?

Modern CNNs like ConvNeXt and RepVGG have closed the accuracy gap with ViT while maintaining speed advantages. Some researchers argue that the ViT revolution was overhyped—that with proper scaling and modern training tricks (e.g., LayerScale, stochastic depth), CNNs can match Transformers. The debate is far from settled.

AINews Verdict & Predictions

Verdict: ViT was not the death of CNNs, but it was the end of CNN monopoly. It proved that attention-based architectures can achieve state-of-the-art vision performance, and it unlocked the multimodal revolution that defines today's AI landscape. The open-source repository has been a catalyst for innovation, spawning hundreds of derivative works.

Predictions:

1. By 2026, hybrid architectures will dominate production. Pure ViT will remain in research and multimodal systems, but most deployed vision models will use a CNN stem (for efficiency) followed by a Transformer encoder (for global context). Google's MaxViT and Meta's FocalNet are early examples.

2. ViT will become the default for generative vision. Image generation (Stable Diffusion, DALL·E 3), video generation (Sora, Gen-3), and 3D generation all use ViT-based encoders. This trend will accelerate as diffusion models and autoregressive vision models converge.

3. The data efficiency gap will shrink. Techniques like masked autoencoding (MAE), data augmentation, and synthetic data generation will reduce ViT's data requirements by 10x within two years. Meta's DINOv2 already achieves competitive results with zero labeled data.

4. Hardware will make ViT cost-competitive. NVIDIA's next-generation Blackwell architecture and Google's TPU v6 will include specialized attention accelerators that bring ViT inference costs below CNN levels for high-throughput scenarios.

5. The biggest winner will be multimodal AI. ViT's ability to represent images as sequences is the key enabler for unified models that process text, images, video, and audio in a single architecture. Google's Gemini, OpenAI's GPT-5, and Meta's next-generation models will all be built on ViT-like backbones.

What to watch next: The open-source community's response to ViT-22B (Google's 22-billion-parameter ViT) and the emergence of vision-only foundation models that rival CLIP in zero-shot performance. Also monitor the ongoing debate between pure Transformers and state-space models (Mamba, S4) for vision—the next paradigm shift may already be brewing.

More from GitHub

MOSS-TTS-Nano: 0.1B 파라미터 모델, 모든 CPU에 음성 AI를The OpenMOSS team and MOSI.AI have released MOSS-TTS-Nano, a tiny yet powerful text-to-speech model that redefines what'WMPFDebugger: Windows에서 WeChat 미니 프로그램 디버깅을 드디어 해결하는 오픈소스 도구For years, debugging WeChat mini programs on a Windows PC has been a pain point. Developers were forced to rely on the WAG-UI Hooks: AI 에이전트 프론트엔드를 표준화할 React 라이브러리The ayushgupta11/agui-hooks repository introduces a production-ready React wrapper for the AG-UI (Agent-GUI) protocol, aOpen source hub1714 indexed articles from GitHub

Related topics

transformer architecture27 related articles

Archive

April 20263042 published articles

Further Reading

마스크 오토인코더가 컴퓨터 비전을 재편하다: FAIR의 MAE 혁신FAIR의 마스크 오토인코더(MAE)는 컴퓨터 비전 분야에서 획기적인 자기 지도 사전 학습 방법으로 부상했습니다. 이미지 패치의 75%를 무작위로 마스킹하고 누락된 픽셀만 재구성함으로써, MAE는 계산 비용을 크게 InsightFace: 오픈소스 프로젝트가 얼굴 분석의 사실상 표준이 된 방법InsightFace는 니치 GitHub 저장소에서 출발해 전 세계 2D 및 3D 얼굴 분석의 기초 툴킷으로 부상했습니다. 포괄적인 파이프라인과 획기적인 ArcFace 손실 함수는 정확도에 대한 새로운 기준을 제시했OpenRelay: 무료 AI 모델 통합이 개발자 경제를 뒤흔들다OpenRelay는 가벼운 오픈소스 프로젝트로, 단일 API 엔드포인트를 통해 개발자에게 수백 개의 무료 AI 모델 할당량을 제공합니다. 이 도구는 AI 실험의 진입 장벽을 대폭 낮추는 것을 목표로 하지만, 신뢰성과Yao Open Prompts, 중국 AI 프롬프트 엔지니어링 표준 재정의중국 AI 생태계는 오랫동안 고품질 프롬프트 엔지니어링을 위한 표준화된 저장소가 부족했습니다. Yao Open Prompts는 커뮤니티 기반 라이브러리로 이 공백을 메우며, 중국어 사용자를 위한 대규모 언어 모델 상

常见问题

GitHub 热点“Vision Transformer: How Google Research Upended 10 Years of CNN Dominance in Computer Vision”主要讲了什么?

In June 2021, Google Research published a paper and open-sourced a model that would fundamentally alter the trajectory of computer vision: the Vision Transformer (ViT). For nearly…

这个 GitHub 项目在“ViT vs CNN benchmark comparison 2025”上为什么会引发关注?

The Vision Transformer (ViT) represents a radical departure from the convolutional paradigm that dominated computer vision since AlexNet in 2012. At its core, ViT treats an image as a 1D sequence of patches, analogous to…

从“ViT inference cost optimization techniques”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 12502,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。