นักศึกษาปริญญาตรีสร้าง ML Stack ตั้งแต่ต้นจนจบ ฝึกฝน Transformer 12 ล้านพารามิเตอร์ด้วย Rust

A project undertaken by two undergraduate students is challenging conventional wisdom about how to learn and contribute to AI systems development. Rather than building atop established frameworks like PyTorch or TensorFlow, the duo embarked on a four-month journey to construct an entire machine learning stack from scratch. Their toolkit, written primarily in Rust, includes hand-coded CUDA kernels for performance-critical operations such as Flash Attention, layer normalization, and the AdamW optimizer. The framework's architecture is notably bifurcated: a high-performance Rust/CUDA backend for training and inference, paired with a strategically designed TypeScript API and WebGPU compute path. This WebGPU integration serves as both a practical fallback for environments without NVIDIA hardware and a forward-looking bet on portable, vendor-agnostic accelerated computing.

The project's capstone achievement was the successful training of a Transformer model with 12 million parameters, a non-trivial task that validates the framework's core correctness and stability. The significance lies not in displacing industrial tools but in the profound educational and philosophical statement it makes. In an era of increasing abstraction and "black box" AI tooling, this effort represents a deliberate return to first principles. By implementing every layer of the stack—from memory management and tensor operations to attention mechanisms and optimization—the developers gained an intimate, systems-level understanding of the performance bottlenecks, memory constraints, and engineering trade-offs that define modern AI. This "build the rocket engine to understand the rocket" approach is gaining traction as a potent method for cultivating the next generation of AI infrastructure architects and researchers who can innovate beyond incremental improvements on existing platforms.

Technical Deep Dive

The project's technical stack is a deliberate assembly of modern, performance-oriented, and forward-looking technologies. At its core is the Rust programming language, chosen for its unique blend of memory safety (via ownership and borrowing) and C/C++-level performance. This is critical for a framework that manages GPU memory, tensor allocations, and autograd graphs, where a single memory safety bug can lead to silent data corruption or crashes. The students implemented their own tensor library, autograd engine, and module system, avoiding dependencies on `ndarray` or `autograd` crates to maintain full control and understanding.

The most impressive technical feats are the custom CUDA kernels. Writing production-grade CUDA code is notoriously difficult, requiring deep knowledge of GPU architecture, memory hierarchies, and parallel programming paradigms. The team implemented:
1. A Flash Attention kernel: This is the state-of-the-art algorithm for computing attention in Transformers, optimizing for IO-awareness between GPU memory hierarchies (HBM vs. SRAM). Their implementation likely follows the seminal paper from Tri Dao, aiming to achieve near-theoretical FLOP utilization for attention blocks.
2. Layer Normalization & AdamW kernels: These are standard but performance-sensitive components. Fusing operations (e.g., normalization with a subsequent residual add) into single kernels reduces memory bandwidth pressure and kernel launch overhead.

A strategically novel component is the dual-backend design. Alongside the primary Rust/CUDA path, they built a WebGPU backend. WebGPU is a emerging web standard providing low-level, cross-platform access to GPU hardware (Vulkan, Metal, DirectX 12). This allows the framework to run in browsers or Node.js environments without CUDA drivers. The TypeScript API acts as a bridge, making the framework's capabilities accessible to the vast JavaScript/TypeScript web development ecosystem for inference, fine-tuning, or even small-scale training directly in the browser.

| Component | Implementation Language | Key Innovation | Performance Target |
|---|---|---|---|
| Core Tensor & Autograd | Rust (no_std possible) | Manual memory management, custom ops | CPU-bound pre/post processing |
| Training Kernels (e.g., AdamW, LayerNorm) | Rust + CUDA | Fused operations, optimized memory access | Maximize GPU compute utilization |
| Attention Kernel | Rust + CUDA | Flash Attention v2 implementation | IO-bound optimization for attention |
| Inference Runtime | Rust + CUDA / WebGPU | Dual-backend, single API | Low-latency deployment across platforms |
| Language Binding | TypeScript/JavaScript (via wasm or FFI) | First-class API for web developers | Democratize model access |

Data Takeaway: The architecture table reveals a conscious separation of concerns and a polyglot strategy. The high-performance core is isolated in Rust/CUDA, while the accessibility layer is in TypeScript/WebGPU. This mirrors industry trends (like ONNX Runtime's multiple execution providers) but at a much more integrated and educational level.

Key Players & Case Studies

This project exists within a broader context of individuals and organizations pushing for deeper system understanding and more efficient, portable AI stacks.

Educational Pioneers: The project is spiritually aligned with educational initiatives like Andrej Karpathy's `micrograd` and `nanoGPT`, which demonstrate neural network fundamentals in minimal code. However, this undergraduate project operates at a significantly lower level, dealing with GPU kernels rather than Python NumPy. Jeremy Howard and Rachel Thomas of fast.ai have long advocated for a "bottom-up" learning approach, though their curriculum typically starts higher in the stack than CUDA kernel programming.

Industry & Open Source Parallels: While not direct competitors, several projects share philosophical or technical overlap:
1. `candle` by Hugging Face: A minimalist ML framework in Rust, with a focus on performance and serverless inference. The undergraduate project is like a from-scratch, educational precursor to `candle`, including the WebGPU target.
2. `llama.cpp` by Georgi Gerganov: A port of Facebook's LLaMA model in pure C/C++ with CPU inference. It demonstrates the power and performance possible by stripping away large framework overheads, a principle the students applied across their entire stack.
3. Google's JAX and XLA: While monolithic, JAX's design of composable function transformations and XLA's compiler-based optimizations represent the industrial end-state of thinking deeply about computation graphs. The students' autograd engine is a tiny step toward this world.

| Project | Primary Language | Focus | Key Differentiator |
|---|---|---|---|
| This Undergrad Project | Rust | Education / Full-stack understanding | From-scratch CUDA kernels, WebGPU fallback |
| PyTorch | C++ / Python | Industrial Research & Production | Dynamic graphs, massive ecosystem |
| `candle` (Hugging Face) | Rust | Efficient Inference | No Python, small footprint, WebAssembly support |
| `llama.cpp` | C/C++ | LLM Inference on CPU | Quantization, minimal dependencies, wide hardware support |
| JAX | Python / C++ | Composable Transformations | `jit`, `vmap`, `pmap`, XLA compiler backend |

Data Takeaway: The comparison shows the undergraduate project occupies a unique niche: it is more systems-focused and educational than PyTorch/JAX, more comprehensive (including training) than `llama.cpp`, and more experimental with its dual-backend design than the current `candle`. Its value is as a complete reference implementation.

Industry Impact & Market Dynamics

The immediate commercial impact of a single undergraduate project is negligible. However, it exemplifies and accelerates several critical trends with substantial long-term market implications.

1. The Rise of the Rust AI Stack: Rust is gaining serious momentum in AI infrastructure due to its safety and performance. Companies like Microsoft are exploring it for secure AI components, and Meta has used it in PyTorch for core libraries. This project adds to the growing body of proof that Rust is viable for the entire ML pipeline, from kernel to API. This could gradually erode the dominance of C++ in high-performance AI backends.

2. WebGPU as a Disruptive Force: NVIDIA's CUDA ecosystem enjoys a near-monopoly on AI training. WebGPU, backed by Apple, Google, Mozilla, and others, presents the first credible, open-standard alternative for portable accelerated compute. By integrating WebGPU as a first-class citizen, this project highlights a potential future where AI models are trained and deployed on any GPU (AMD, Intel, Apple Silicon) via a single codebase. This aligns with Apple's push for Core ML and Intel's oneAPI, aiming to break CUDA's lock-in.

3. Democratization and Edge AI: The TypeScript API and WebGPU backend directly serve the trend of deploying smaller, efficient models at the edge—in browsers, on mobile devices, or in serverless functions. The market for edge AI hardware and software is projected to grow exponentially, and frameworks that bridge the gap between high-performance training and flexible deployment will be crucial.

| Market Segment | 2024 Est. Size | 2029 Projection | CAGR | Key Driver |
|---|---|---|---|---|
| AI Infrastructure Software | $25B | $80B | ~26% | Enterprise AI adoption, scaling needs |
| Edge AI Software | $12B | $40B | ~27% | IoT, real-time inference, privacy |
| AI Developer Tools & Frameworks | $8B | $25B | ~25% | Growing developer base, specialization |
| Alternative AI Chip Market (non-NVIDIA) | $10B | $50B+ | ~38% | Supply constraints, cost, specialization (e.g., Groq, Cerebras, SambaNova) |

Data Takeaway: The high growth rates, especially in Edge AI and alternative chips, create a fertile ground for new software stacks. A framework designed for portability and efficiency from the ground up, as demonstrated in this project, is strategically positioned for these high-growth areas, unlike legacy frameworks burdened with historical design choices.

Risks, Limitations & Open Questions

Despite its brilliance as an educational endeavor, the project faces significant hurdles before it could be considered for practical use.

Technical Limitations: The framework is undoubtedly narrow in scope compared to PyTorch. It lacks:
* Distributed training capabilities (multi-GPU, multi-node).
* A comprehensive operator library (convolutional layers, RNNs, specialized attention variants).
* Robust tooling for profiling, debugging, and visualization.
* Quantization support, pruning, and other model optimization techniques crucial for deployment.
* The ecosystem of pre-trained models, datasets, and community contributions.

Performance Uncertainties: While custom kernels can be optimal, they require immense expertise to match or beat the years of optimization poured into libraries like cuDNN, cuBLAS, and TensorRT. The students' Flash Attention kernel, while correct, may not achieve the same peak performance as the official implementation from Tri Dao and Daniel Haziza.

Sustainability & Maintenance: This is a "two-person, four-month" codebase. Maintaining a production-grade framework is a full-time endeavor for large teams. The project risks becoming abandonware without ongoing commitment or institutional support.

Open Questions:
1. Can the educational value of "building from scratch" be systematically packaged and scaled? Could this become a new curriculum standard for top-tier CS/AI programs?
2. Is the dual-backend approach (CUDA/WebGPU) a maintainable long-term strategy, or does it double the development burden?
3. Does deep systems knowledge actually lead to more innovative AI research, or does it simply produce better engineers? The history of science suggests breakthrough ideas often come from understanding fundamentals.

AINews Verdict & Predictions

This undergraduate project is far more than a clever coding exercise; it is a compelling manifesto for a different kind of AI literacy. In a field obsessed with scaling parameters and chasing SOTA benchmarks, it argues that foundational knowledge of the computational substrate is not just valuable—it is essential for the next wave of innovation, particularly in efficiency and portability.

Our Predictions:
1. Educational Paradigm Shift: Within three years, we predict the emergence of formalized university courses and bootcamps centered on "Building a Deep Learning Framework" as a capstone project. These will use Rust and WebGPU as primary tools, creating a new cohort of developers with hybrid skills in systems programming and machine learning.
2. Niche Framework Proliferation: The success of `candle` and `llama.cpp` shows there is market appetite for lean, specialized frameworks. We will see more startups and open-source projects inspired by this from-scratch philosophy, focusing on specific niches like robotics (real-time control), on-device LLMs, or scientific AI, where PyTorch/TensorFlow are overkill.
3. WebGPU Becomes a Major AI Runtime: Within five years, WebGPU will be a standard deployment target for inference and light training, supported by all major ML frameworks. Projects like this one are the early prototypes. This will be accelerated by the growth of the alternative AI chip market, whose vendors will champion open standards over CUDA.
4. Corporate Acquisition of Talent: The students behind this project, and others like them, will become highly sought-after recruits for companies like NVIDIA, Intel, Google, and Apple, as well as AI infrastructure startups like Modular or Anyscale. Their skills are precisely those needed to build the next generation of AI systems.

The ultimate takeaway is that the future of AI tooling may not be built by incrementally improving the giants of today, but by a new generation of developers who, having learned by building the engine themselves, are unafraid to reimagine it entirely. This project is a seed; the industry should watch carefully for what grows from it.

More from Hacker News

常见问题

GitHub 热点“Undergraduates Build Full ML Stack from Scratch, Training 12M-Parameter Transformer in Rust”主要讲了什么？

A project undertaken by two undergraduate students is challenging conventional wisdom about how to learn and contribute to AI systems development. Rather than building atop establi…

这个 GitHub 项目在“Rust machine learning framework tutorial from scratch”上为什么会引发关注？

The project's technical stack is a deliberate assembly of modern, performance-oriented, and forward-looking technologies. At its core is the Rust programming language, chosen for its unique blend of memory safety (via ow…

从“How to implement Flash Attention in CUDA for educational purposes”看，这个 GitHub 项目的热度表现如何？

当前相关 GitHub 项目总星标约为 0，近一日增长约为 0，这说明它在开源社区具有较强讨论度和扩散能力。