Vision Transformer: Google Research가 컴퓨터 비전에서 10년간의 CNN 지배를 뒤집은 방법

2026년 5월 1일 AM 12:12 AINews GitHub April 2026

⭐ 12502

Source: GitHub transformer architecture Archive: April 2026

Google Research의 Vision Transformer(ViT)는 컴퓨터 비전에서 컨볼루션 신경망의 10년간의 지배를 무너뜨렸습니다. 이미지를 패치 시퀀스로 처리하고 순수 Transformer 인코더를 적용함으로써, ViT는 최첨단 분류 성능을 달성합니다——단, 대규모 데이터로 훈련된 경우에만 가능합니다.

The article body is currently shown in English by default. You can generate the full version in this language on demand.

In June 2021, Google Research published a paper and open-sourced a model that would fundamentally alter the trajectory of computer vision: the Vision Transformer (ViT). For nearly a decade, convolutional neural networks (CNNs) like ResNet, EfficientNet, and ConvNeXt had been the undisputed standard for image understanding. ViT challenged that orthodoxy by demonstrating that a pure Transformer architecture—originally designed for natural language processing—could match and even exceed CNN performance on image classification, provided it was pre-trained on sufficiently large datasets (e.g., JFT-300M with 300 million images).

The core innovation is deceptively simple: ViT splits an image into fixed-size patches (typically 16x16 pixels), flattens and linearly embeds each patch into a vector, adds positional embeddings, and feeds the resulting sequence into a standard Transformer encoder. The model has no convolutional layers, no pooling, and no inductive bias for locality or translation equivariance—it must learn spatial relationships entirely from data.

On GitHub, the official repository (google-research/vision_transformer) has accumulated over 12,500 stars and continues to receive daily contributions. The project's significance extends beyond classification benchmarks: ViT has become the backbone for a new generation of multimodal models (e.g., CLIP, DALL·E, Flamingo) and has inspired hundreds of follow-up works including DeiT, Swin Transformer, and masked autoencoders.

This article examines ViT's technical architecture, its real-world adoption by companies like OpenAI and Meta, the market dynamics it has triggered, and the unresolved challenges that remain. AINews concludes with a clear verdict: ViT did not kill CNNs, but it forced the field to accept that attention is all you need—if you have enough data.

Technical Deep Dive

The Vision Transformer (ViT) represents a radical departure from the convolutional paradigm that dominated computer vision since AlexNet in 2012. At its core, ViT treats an image as a 1D sequence of patches, analogous to how a language model treats a sentence as a sequence of tokens.

Architecture Overview:

1. Patch Embedding: An input image of size H×W×C is divided into N patches of size P×P, where N = (H×W)/P². For a standard 224×224 image with P=16, this yields 196 patches. Each patch is flattened into a vector of length P²·C (for RGB, 16×16×3 = 768) and projected to a D-dimensional embedding via a trainable linear layer.

2. Positional Embeddings: Since the Transformer encoder is permutation-invariant, positional information must be injected. ViT uses learnable 1D positional embeddings added to the patch embeddings. Notably, 2D-aware positional embeddings (e.g., relative positions) showed no significant improvement in the original paper, suggesting the model learns spatial structure from the patch content itself.

3. [CLS] Token: Borrowed from BERT, a special learnable token is prepended to the sequence. Its final hidden state serves as the image representation for classification. An alternative approach (used in later works) is to average all patch outputs.

4. Transformer Encoder: The sequence passes through L layers of multi-headed self-attention (MSA) and MLP blocks, with LayerNorm applied before each block and residual connections after. The original ViT-Base uses L=12, hidden size D=768, and 12 attention heads.

5. Classification Head: A simple MLP (or linear layer) on top of the [CLS] token output produces the final class logits.

Key Technical Insight – Inductive Bias Trade-off:

CNNs have strong inductive biases: locality (convolutional kernels look at small neighborhoods), translation equivariance (a cat is a cat regardless of position), and hierarchical feature learning (edges → textures → parts → objects). ViT has almost none of these—it must learn them from data. This is why ViT underperforms CNNs on small datasets (e.g., ImageNet-1K with 1.2M images) but excels on massive datasets (JFT-300M with 300M images). The model needs sufficient data to overcome the lack of built-in priors.

Performance Benchmarks:

| Model | Parameters | Pretraining Data | ImageNet Top-1 Accuracy | Throughput (images/sec) |
|---|---|---|---|---|
| ViT-B/16 | 86M | ImageNet-21K | 81.8% | 1,120 |
| ViT-L/16 | 307M | ImageNet-21K | 85.2% | 620 |
| ViT-H/14 | 632M | JFT-300M | 88.6% | 310 |
| ResNet-152 | 60M | ImageNet-1K | 78.6% | 1,450 |
| EfficientNet-B7 | 66M | ImageNet-1K | 84.3% | 1,100 |
| ConvNeXt-L | 198M | ImageNet-21K | 87.5% | 890 |

*Data Takeaway: ViT-H/14 achieves the highest accuracy (88.6%) but requires 10x more parameters than ResNet-152 and 2x more than ConvNeXt-L. The throughput penalty is significant—ViT-H/14 processes only 310 images/sec versus 890 for ConvNeXt-L. This reveals the core trade-off: ViT delivers top accuracy at the cost of computational efficiency.*

Open-Source Ecosystem:

The official repository (google-research/vision_transformer) on GitHub provides JAX/Flax implementations. However, the community has ported ViT to PyTorch (e.g., the popular `vit-pytorch` package by Phil Wang, with over 10k stars) and Hugging Face Transformers. Notable follow-up repositories include:

- DeiT (facebookresearch/deit): Data-efficient Image Transformers that use knowledge distillation to train ViT on ImageNet-1K alone, achieving 85.2% accuracy with ViT-B.
- Swin Transformer (microsoft/Swin-Transformer): Introduces hierarchical feature maps and shifted windows, achieving SOTA on object detection and segmentation.
- MAE (facebookresearch/mae): Masked Autoencoders that pre-train ViT by reconstructing masked patches, setting new records on downstream tasks.

Key Players & Case Studies

Google Research (Original Creator):

The ViT paper was authored by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and others at Google Brain. Their strategy was bold: bet that scaling data would unlock Transformer potential in vision. Google has since integrated ViT into its Cloud Vision API and used it as the backbone for multimodal models like PaLI and PaLM-E. The company's JFT-300M dataset remains proprietary, giving Google a competitive advantage in training large ViT variants.

Meta AI (Facebook Research):

Meta has been the most aggressive adopter and improver of ViT. Their contributions include:

- DeiT (2021): Showed that ViT could be trained without massive datasets using distillation, democratizing the architecture.
- MAE (2022): Introduced masked autoencoding for ViT pre-training, achieving 87.8% ImageNet accuracy with ViT-H and setting records on detection/segmentation.
- DINOv2 (2023): Self-supervised ViT that produces high-quality visual features without labels, used in Meta's Segment Anything Model (SAM).

OpenAI:

OpenAI's CLIP model uses a ViT-like vision encoder (though with some modifications) paired with a text encoder for zero-shot classification. DALL·E 2 and DALL·E 3 also use ViT-based encoders for image understanding. OpenAI has not open-sourced their ViT implementations, but their influence on the multimodal trend is undeniable.

Hugging Face:

Hugging Face has integrated ViT into its Transformers library, making it accessible to millions of developers. As of April 2025, the ViT model card has over 2 million monthly downloads, and there are over 500 community-trained ViT variants for specialized tasks (medical imaging, satellite imagery, etc.).

Comparison of ViT-based Models:

| Model | Creator | Key Innovation | ImageNet Acc. | Inference Speed (ms/img) | Open Source |
|---|---|---|---|---|---|
| ViT-H/14 | Google | Pure Transformer scaling | 88.6% | 3.2 | Yes (JAX) |
| DeiT-B | Meta | Distillation from CNN teacher | 85.2% | 1.8 | Yes (PyTorch) |
| Swin-L | Microsoft | Hierarchical windows | 87.3% | 2.1 | Yes (PyTorch) |
| MAE-H | Meta | Masked autoencoding | 87.8% | 3.0 | Yes (PyTorch) |
| ConvNeXt-L | Meta | Modernized CNN | 87.5% | 1.1 | Yes (PyTorch) |

*Data Takeaway: ConvNeXt-L, a modernized CNN, offers the best speed-accuracy trade-off (87.5% at 1.1 ms/img). ViT-H/14 is 3x slower for only 1.1% gain. This explains why many production systems still prefer CNNs or hybrid architectures for latency-sensitive applications.*

Industry Impact & Market Dynamics

ViT's introduction triggered a paradigm shift that reshaped the competitive landscape of computer vision:

1. Funding & Investment:

Venture capital in vision-AI startups surged after ViT's success. In 2022-2024, over $4.2 billion was invested in companies building vision Transformers or multimodal models. Notable rounds:

- Hugging Face: $395M Series D (2022), valuation $4.5B—partially driven by ViT adoption.
- Stability AI: $101M seed (2022)—their Stable Diffusion uses a ViT-based encoder.
- Midjourney: Bootstrapped to $200M ARR—uses custom ViT variants for image generation.

2. Market Shift from CNNs to Transformers:

| Year | % of CV Papers Using Transformers | % of CV Papers Using CNNs | Dominant Architecture |
|---|---|---|---|
| 2020 | 5% | 90% | ResNet, EfficientNet |
| 2021 | 25% | 70% | ViT, DeiT |
| 2022 | 45% | 50% | Swin, MAE |
| 2023 | 60% | 35% | ViT, DINOv2 |
| 2024 | 70% | 25% | ViT, multimodal hybrids |

*Data Takeaway: In just four years, Transformers went from a niche experiment to the dominant architecture in computer vision research. However, CNNs remain strong in production due to their efficiency and maturity.*

3. Multimodal Convergence:

ViT's sequence-based representation made it trivial to combine vision with language. This enabled the explosion of multimodal models:

- CLIP (OpenAI): 400M image-text pairs, zero-shot classification.
- Flamingo (DeepMind): Few-shot visual question answering.
- GPT-4V (OpenAI): Vision capabilities built on a ViT-like encoder.
- Gemini (Google): Multimodal from the ground up, using ViT variants.

The market for multimodal AI is projected to grow from $1.2B in 2023 to $12.8B by 2028 (CAGR 60%). ViT is the foundational backbone for this growth.

4. Hardware Adaptation:

NVIDIA, AMD, and Google have optimized their hardware for Transformer workloads. The A100 and H100 GPUs include dedicated Transformer engines that accelerate attention computations. Google's TPU v4 and v5 are designed with ViT-like models in mind. This hardware-software co-evolution has reduced ViT's inference cost by 40% year-over-year.

Risks, Limitations & Open Questions

1. Data Hunger:

ViT's Achilles' heel remains its insatiable appetite for data. On ImageNet-1K (1.2M images), ViT-B/16 achieves only 77.9% accuracy versus ResNet-152's 78.6%. It requires 10-100x more data to match CNN performance. This limits adoption in domains with scarce labeled data (e.g., medical imaging, rare species classification).

2. Computational Cost:

The quadratic complexity of self-attention (O(N²) for N patches) makes ViT expensive for high-resolution images. A 512×512 image with 16×16 patches yields 1,024 tokens—attention scales quadratically. Solutions like Swin Transformer's windowed attention and Linformer's low-rank approximations exist but add complexity.

3. Lack of Inductive Bias:

While ViT's flexibility is a strength, it also means the model can learn spurious correlations. Studies show ViT is more sensitive to adversarial perturbations than CNNs and can be fooled by high-frequency noise that humans ignore. The absence of built-in locality makes it harder to generalize to out-of-distribution data.

4. Interpretability:

ViT's attention maps are often cited as more interpretable than CNN activation maps, but recent work (e.g., by Binder et al., 2023) shows that attention weights do not necessarily correspond to feature importance. The model's reasoning remains opaque, which is problematic in regulated industries like healthcare and autonomous driving.

5. The CNN Comeback?

Modern CNNs like ConvNeXt and RepVGG have closed the accuracy gap with ViT while maintaining speed advantages. Some researchers argue that the ViT revolution was overhyped—that with proper scaling and modern training tricks (e.g., LayerScale, stochastic depth), CNNs can match Transformers. The debate is far from settled.

AINews Verdict & Predictions

Verdict: ViT was not the death of CNNs, but it was the end of CNN monopoly. It proved that attention-based architectures can achieve state-of-the-art vision performance, and it unlocked the multimodal revolution that defines today's AI landscape. The open-source repository has been a catalyst for innovation, spawning hundreds of derivative works.

Predictions:

1. By 2026, hybrid architectures will dominate production. Pure ViT will remain in research and multimodal systems, but most deployed vision models will use a CNN stem (for efficiency) followed by a Transformer encoder (for global context). Google's MaxViT and Meta's FocalNet are early examples.

2. ViT will become the default for generative vision. Image generation (Stable Diffusion, DALL·E 3), video generation (Sora, Gen-3), and 3D generation all use ViT-based encoders. This trend will accelerate as diffusion models and autoregressive vision models converge.

3. The data efficiency gap will shrink. Techniques like masked autoencoding (MAE), data augmentation, and synthetic data generation will reduce ViT's data requirements by 10x within two years. Meta's DINOv2 already achieves competitive results with zero labeled data.

4. Hardware will make ViT cost-competitive. NVIDIA's next-generation Blackwell architecture and Google's TPU v6 will include specialized attention accelerators that bring ViT inference costs below CNN levels for high-throughput scenarios.

5. The biggest winner will be multimodal AI. ViT's ability to represent images as sequences is the key enabler for unified models that process text, images, video, and audio in a single architecture. Google's Gemini, OpenAI's GPT-5, and Meta's next-generation models will all be built on ViT-like backbones.

What to watch next: The open-source community's response to ViT-22B (Google's 22-billion-parameter ViT) and the emergence of vision-only foundation models that rival CLIP in zero-shot performance. Also monitor the ongoing debate between pure Transformers and state-space models (Mamba, S4) for vision—the next paradigm shift may already be brewing.

常见问题

GitHub 热点“Vision Transformer: How Google Research Upended 10 Years of CNN Dominance in Computer Vision”主要讲了什么？

In June 2021, Google Research published a paper and open-sourced a model that would fundamentally alter the trajectory of computer vision: the Vision Transformer (ViT). For nearly…

这个 GitHub 项目在“ViT vs CNN benchmark comparison 2025”上为什么会引发关注？

从“ViT inference cost optimization techniques”看，这个 GitHub 项目的热度表现如何？

当前相关 GitHub 项目总星标约为 12502，近一日增长约为 0，这说明它在开源社区具有较强讨论度和扩散能力。

Vision Transformer: Google Research가 컴퓨터 비전에서 10년간의 CNN 지배를 뒤집은 방법

Technical Deep Dive

Key Players & Case Studies

Industry Impact & Market Dynamics

Risks, Limitations & Open Questions

AINews Verdict & Predictions

More from GitHub

Related topics

Archive

Further Reading

常见问题