대학생들이 ML 스택을 처음부터 구축, Rust로 1200만 파라미터 Transformer 학습

Hacker News April 2026
Source: Hacker NewsArchive: April 2026
놀라운 4개월 간의 심층 연구 끝에, 컴퓨터 과학 전공 2학년 학생 두 명이 기본 원리부터 완전히 작동하는 머신러닝 프레임워크를 설계했습니다. 그들은 Rust와 CUDA를 사용해 Flash Attention 같은 핵심 알고리즘을 구현하고 맞춤형 학습 파이프라인을 구축한 후, 학습을 통해 이를 검증했습니다.
The article body is currently shown in English by default. You can generate the full version in this language on demand.

A project undertaken by two undergraduate students is challenging conventional wisdom about how to learn and contribute to AI systems development. Rather than building atop established frameworks like PyTorch or TensorFlow, the duo embarked on a four-month journey to construct an entire machine learning stack from scratch. Their toolkit, written primarily in Rust, includes hand-coded CUDA kernels for performance-critical operations such as Flash Attention, layer normalization, and the AdamW optimizer. The framework's architecture is notably bifurcated: a high-performance Rust/CUDA backend for training and inference, paired with a strategically designed TypeScript API and WebGPU compute path. This WebGPU integration serves as both a practical fallback for environments without NVIDIA hardware and a forward-looking bet on portable, vendor-agnostic accelerated computing.

The project's capstone achievement was the successful training of a Transformer model with 12 million parameters, a non-trivial task that validates the framework's core correctness and stability. The significance lies not in displacing industrial tools but in the profound educational and philosophical statement it makes. In an era of increasing abstraction and "black box" AI tooling, this effort represents a deliberate return to first principles. By implementing every layer of the stack—from memory management and tensor operations to attention mechanisms and optimization—the developers gained an intimate, systems-level understanding of the performance bottlenecks, memory constraints, and engineering trade-offs that define modern AI. This "build the rocket engine to understand the rocket" approach is gaining traction as a potent method for cultivating the next generation of AI infrastructure architects and researchers who can innovate beyond incremental improvements on existing platforms.

Technical Deep Dive

The project's technical stack is a deliberate assembly of modern, performance-oriented, and forward-looking technologies. At its core is the Rust programming language, chosen for its unique blend of memory safety (via ownership and borrowing) and C/C++-level performance. This is critical for a framework that manages GPU memory, tensor allocations, and autograd graphs, where a single memory safety bug can lead to silent data corruption or crashes. The students implemented their own tensor library, autograd engine, and module system, avoiding dependencies on `ndarray` or `autograd` crates to maintain full control and understanding.

The most impressive technical feats are the custom CUDA kernels. Writing production-grade CUDA code is notoriously difficult, requiring deep knowledge of GPU architecture, memory hierarchies, and parallel programming paradigms. The team implemented:
1. A Flash Attention kernel: This is the state-of-the-art algorithm for computing attention in Transformers, optimizing for IO-awareness between GPU memory hierarchies (HBM vs. SRAM). Their implementation likely follows the seminal paper from Tri Dao, aiming to achieve near-theoretical FLOP utilization for attention blocks.
2. Layer Normalization & AdamW kernels: These are standard but performance-sensitive components. Fusing operations (e.g., normalization with a subsequent residual add) into single kernels reduces memory bandwidth pressure and kernel launch overhead.

A strategically novel component is the dual-backend design. Alongside the primary Rust/CUDA path, they built a WebGPU backend. WebGPU is a emerging web standard providing low-level, cross-platform access to GPU hardware (Vulkan, Metal, DirectX 12). This allows the framework to run in browsers or Node.js environments without CUDA drivers. The TypeScript API acts as a bridge, making the framework's capabilities accessible to the vast JavaScript/TypeScript web development ecosystem for inference, fine-tuning, or even small-scale training directly in the browser.

| Component | Implementation Language | Key Innovation | Performance Target |
|---|---|---|---|
| Core Tensor & Autograd | Rust (no_std possible) | Manual memory management, custom ops | CPU-bound pre/post processing |
| Training Kernels (e.g., AdamW, LayerNorm) | Rust + CUDA | Fused operations, optimized memory access | Maximize GPU compute utilization |
| Attention Kernel | Rust + CUDA | Flash Attention v2 implementation | IO-bound optimization for attention |
| Inference Runtime | Rust + CUDA / WebGPU | Dual-backend, single API | Low-latency deployment across platforms |
| Language Binding | TypeScript/JavaScript (via wasm or FFI) | First-class API for web developers | Democratize model access |

Data Takeaway: The architecture table reveals a conscious separation of concerns and a polyglot strategy. The high-performance core is isolated in Rust/CUDA, while the accessibility layer is in TypeScript/WebGPU. This mirrors industry trends (like ONNX Runtime's multiple execution providers) but at a much more integrated and educational level.

Key Players & Case Studies

This project exists within a broader context of individuals and organizations pushing for deeper system understanding and more efficient, portable AI stacks.

Educational Pioneers: The project is spiritually aligned with educational initiatives like Andrej Karpathy's `micrograd` and `nanoGPT`, which demonstrate neural network fundamentals in minimal code. However, this undergraduate project operates at a significantly lower level, dealing with GPU kernels rather than Python NumPy. Jeremy Howard and Rachel Thomas of fast.ai have long advocated for a "bottom-up" learning approach, though their curriculum typically starts higher in the stack than CUDA kernel programming.

Industry & Open Source Parallels: While not direct competitors, several projects share philosophical or technical overlap:
1. `candle` by Hugging Face: A minimalist ML framework in Rust, with a focus on performance and serverless inference. The undergraduate project is like a from-scratch, educational precursor to `candle`, including the WebGPU target.
2. `llama.cpp` by Georgi Gerganov: A port of Facebook's LLaMA model in pure C/C++ with CPU inference. It demonstrates the power and performance possible by stripping away large framework overheads, a principle the students applied across their entire stack.
3. Google's JAX and XLA: While monolithic, JAX's design of composable function transformations and XLA's compiler-based optimizations represent the industrial end-state of thinking deeply about computation graphs. The students' autograd engine is a tiny step toward this world.

| Project | Primary Language | Focus | Key Differentiator |
|---|---|---|---|
| This Undergrad Project | Rust | Education / Full-stack understanding | From-scratch CUDA kernels, WebGPU fallback |
| PyTorch | C++ / Python | Industrial Research & Production | Dynamic graphs, massive ecosystem |
| `candle` (Hugging Face) | Rust | Efficient Inference | No Python, small footprint, WebAssembly support |
| `llama.cpp` | C/C++ | LLM Inference on CPU | Quantization, minimal dependencies, wide hardware support |
| JAX | Python / C++ | Composable Transformations | `jit`, `vmap`, `pmap`, XLA compiler backend |

Data Takeaway: The comparison shows the undergraduate project occupies a unique niche: it is more systems-focused and educational than PyTorch/JAX, more comprehensive (including training) than `llama.cpp`, and more experimental with its dual-backend design than the current `candle`. Its value is as a complete reference implementation.

Industry Impact & Market Dynamics

The immediate commercial impact of a single undergraduate project is negligible. However, it exemplifies and accelerates several critical trends with substantial long-term market implications.

1. The Rise of the Rust AI Stack: Rust is gaining serious momentum in AI infrastructure due to its safety and performance. Companies like Microsoft are exploring it for secure AI components, and Meta has used it in PyTorch for core libraries. This project adds to the growing body of proof that Rust is viable for the entire ML pipeline, from kernel to API. This could gradually erode the dominance of C++ in high-performance AI backends.

2. WebGPU as a Disruptive Force: NVIDIA's CUDA ecosystem enjoys a near-monopoly on AI training. WebGPU, backed by Apple, Google, Mozilla, and others, presents the first credible, open-standard alternative for portable accelerated compute. By integrating WebGPU as a first-class citizen, this project highlights a potential future where AI models are trained and deployed on any GPU (AMD, Intel, Apple Silicon) via a single codebase. This aligns with Apple's push for Core ML and Intel's oneAPI, aiming to break CUDA's lock-in.

3. Democratization and Edge AI: The TypeScript API and WebGPU backend directly serve the trend of deploying smaller, efficient models at the edge—in browsers, on mobile devices, or in serverless functions. The market for edge AI hardware and software is projected to grow exponentially, and frameworks that bridge the gap between high-performance training and flexible deployment will be crucial.

| Market Segment | 2024 Est. Size | 2029 Projection | CAGR | Key Driver |
|---|---|---|---|---|
| AI Infrastructure Software | $25B | $80B | ~26% | Enterprise AI adoption, scaling needs |
| Edge AI Software | $12B | $40B | ~27% | IoT, real-time inference, privacy |
| AI Developer Tools & Frameworks | $8B | $25B | ~25% | Growing developer base, specialization |
| Alternative AI Chip Market (non-NVIDIA) | $10B | $50B+ | ~38% | Supply constraints, cost, specialization (e.g., Groq, Cerebras, SambaNova) |

Data Takeaway: The high growth rates, especially in Edge AI and alternative chips, create a fertile ground for new software stacks. A framework designed for portability and efficiency from the ground up, as demonstrated in this project, is strategically positioned for these high-growth areas, unlike legacy frameworks burdened with historical design choices.

Risks, Limitations & Open Questions

Despite its brilliance as an educational endeavor, the project faces significant hurdles before it could be considered for practical use.

Technical Limitations: The framework is undoubtedly narrow in scope compared to PyTorch. It lacks:
* Distributed training capabilities (multi-GPU, multi-node).
* A comprehensive operator library (convolutional layers, RNNs, specialized attention variants).
* Robust tooling for profiling, debugging, and visualization.
* Quantization support, pruning, and other model optimization techniques crucial for deployment.
* The ecosystem of pre-trained models, datasets, and community contributions.

Performance Uncertainties: While custom kernels can be optimal, they require immense expertise to match or beat the years of optimization poured into libraries like cuDNN, cuBLAS, and TensorRT. The students' Flash Attention kernel, while correct, may not achieve the same peak performance as the official implementation from Tri Dao and Daniel Haziza.

Sustainability & Maintenance: This is a "two-person, four-month" codebase. Maintaining a production-grade framework is a full-time endeavor for large teams. The project risks becoming abandonware without ongoing commitment or institutional support.

Open Questions:
1. Can the educational value of "building from scratch" be systematically packaged and scaled? Could this become a new curriculum standard for top-tier CS/AI programs?
2. Is the dual-backend approach (CUDA/WebGPU) a maintainable long-term strategy, or does it double the development burden?
3. Does deep systems knowledge actually lead to more innovative AI research, or does it simply produce better engineers? The history of science suggests breakthrough ideas often come from understanding fundamentals.

AINews Verdict & Predictions

This undergraduate project is far more than a clever coding exercise; it is a compelling manifesto for a different kind of AI literacy. In a field obsessed with scaling parameters and chasing SOTA benchmarks, it argues that foundational knowledge of the computational substrate is not just valuable—it is essential for the next wave of innovation, particularly in efficiency and portability.

Our Predictions:
1. Educational Paradigm Shift: Within three years, we predict the emergence of formalized university courses and bootcamps centered on "Building a Deep Learning Framework" as a capstone project. These will use Rust and WebGPU as primary tools, creating a new cohort of developers with hybrid skills in systems programming and machine learning.
2. Niche Framework Proliferation: The success of `candle` and `llama.cpp` shows there is market appetite for lean, specialized frameworks. We will see more startups and open-source projects inspired by this from-scratch philosophy, focusing on specific niches like robotics (real-time control), on-device LLMs, or scientific AI, where PyTorch/TensorFlow are overkill.
3. WebGPU Becomes a Major AI Runtime: Within five years, WebGPU will be a standard deployment target for inference and light training, supported by all major ML frameworks. Projects like this one are the early prototypes. This will be accelerated by the growth of the alternative AI chip market, whose vendors will champion open standards over CUDA.
4. Corporate Acquisition of Talent: The students behind this project, and others like them, will become highly sought-after recruits for companies like NVIDIA, Intel, Google, and Apple, as well as AI infrastructure startups like Modular or Anyscale. Their skills are precisely those needed to build the next generation of AI systems.

The ultimate takeaway is that the future of AI tooling may not be built by incrementally improving the giants of today, but by a new generation of developers who, having learned by building the engine themselves, are unafraid to reimagine it entirely. This project is a seed; the industry should watch carefully for what grows from it.

More from Hacker News

AI 프로그래밍의 신기루: 왜 우리는 여전히 기계가 작성한 소프트웨어를 갖지 못하는가The developer community is grappling with a profound paradox: while AI coding assistants like GitHub Copilot, Amazon CodMeshcore 아키텍처 등장: 분산형 P2P 추론 네트워크가 AI 헤게모니에 도전할 수 있을까?The AI infrastructure landscape is witnessing the early stirrings of a paradigm war. At its center is the concept of MesAI 가시성, 폭발적 추론 비용 관리의 핵심 분야로 부상The initial euphoria surrounding large language models has given way to a sobering operational phase where the true costOpen source hub2136 indexed articles from Hacker News

Archive

April 20261680 published articles

Further Reading

AI 프로그래밍의 신기루: 왜 우리는 여전히 기계가 작성한 소프트웨어를 갖지 못하는가생성 AI는 개발자의 코드 작성 방식을 변화시켰지만, 기계가 완전히 작성한 소프트웨어라는 약속은 여전히 이루어지지 않고 있습니다. 이 격차는 현재 AI의 장기적 아키텍처 일관성 관리와 시스템 수준 추론 능력에 근본적Meshcore 아키텍처 등장: 분산형 P2P 추론 네트워크가 AI 헤게모니에 도전할 수 있을까?Meshcore라는 새로운 아키텍처 프레임워크가 주목받고 있으며, 이는 중앙 집중식 AI 클라우드 서비스에 대한 급진적인 대안을 제시합니다. 소비자용 GPU와 특수 칩을 피어투피어 추론 네트워크로 구성함으로써, 대규AI 가시성, 폭발적 추론 비용 관리의 핵심 분야로 부상생성형 AI 산업은 가혹한 재정적 현실에 직면해 있습니다: 모니터링되지 않은 추론 비용이 마진을 잠식하고 배포를 차질시키고 있습니다. 이러한 비용을 관리하는 데 필요한 심층적인 가시성을 제공하기 위해 새로운 범주의 Hyperloom의 타임트래블 디버거, 멀티 에이전트 AI의 핵심 인프라 격차 해결Hyperloom이라는 새로운 오픈소스 프로젝트가 등장하여 프로덕션 AI에서 가장 중요하지만 간과되기 쉬운 계층인 멀티 에이전트 시스템의 디버깅 및 상태 관리에 주목합니다. 에이전트 클러스터를 결정론적 상태 머신으로

常见问题

GitHub 热点“Undergraduates Build Full ML Stack from Scratch, Training 12M-Parameter Transformer in Rust”主要讲了什么?

A project undertaken by two undergraduate students is challenging conventional wisdom about how to learn and contribute to AI systems development. Rather than building atop establi…

这个 GitHub 项目在“Rust machine learning framework tutorial from scratch”上为什么会引发关注?

The project's technical stack is a deliberate assembly of modern, performance-oriented, and forward-looking technologies. At its core is the Rust programming language, chosen for its unique blend of memory safety (via ow…

从“How to implement Flash Attention in CUDA for educational purposes”看,这个 GitHub 项目的热度表现如何?

当前相关 GitHub 项目总星标约为 0,近一日增长约为 0,这说明它在开源社区具有较强讨论度和扩散能力。