Từ Số Không Đến GPT: Bên Trong Cuốn Sách Mã Nguồn Mở Dạy LLM Từ Đầu

The open-source project rasbt/llms-from-scratch, authored by Sebastian Raschka, has rapidly ascended to become one of the most starred AI education repositories on GitHub. It provides a step-by-step, code-first journey through building a large language model (LLM) similar in spirit to ChatGPT, using only PyTorch and no black-box libraries. The project is unique in its pedagogical rigor: it starts with raw text tokenization, walks through the complete Transformer architecture (multi-head self-attention, feed-forward layers, layer normalization), and then covers pretraining, instruction fine-tuning, and even alignment techniques like RLHF. A companion O'Reilly book, 'Build a Large Language Model (From Scratch)', provides the theoretical backbone. The project's significance lies in democratizing deep understanding of LLMs at a time when most practitioners treat them as opaque APIs. It is not a production framework but an educational masterpiece, and its popularity signals a hunger in the developer community for genuine comprehension over surface-level tool usage. The repository's star growth—over 400 new stars daily—reflects a global movement toward AI literacy that prioritizes fundamentals over hype.

Technical Deep Dive

rasbt/llms-from-scratch is not merely a code dump; it is a carefully sequenced curriculum that mirrors the historical and technical evolution of modern LLMs. The core architecture implemented is a decoder-only Transformer, the same family as GPT-2, GPT-3, and ChatGPT. The repository builds this from the ground up in pure PyTorch, avoiding high-level abstractions like Hugging Face Transformers until the final chapters.

Architecture Walkthrough:
- Tokenization: The project implements Byte-Pair Encoding (BPE) from scratch, demonstrating how raw text is converted into integer token IDs. This is a critical but often glossed-over step.
- Multi-Head Self-Attention: The heart of the Transformer. The code implements causal (masked) attention, scaled dot-product attention, and the concatenation of multiple attention heads. The explanation of the query, key, value projections is exceptionally clear.
- Layer Normalization & Feed-Forward Networks: Standard Transformer blocks are built with residual connections, layer norm (applied before each sub-layer, as in GPT-2), and a two-layer feed-forward network with GELU activation.
- Positional Embeddings: The project uses learned absolute positional embeddings, consistent with the original GPT architecture.
- Pretraining Objective: Causal language modeling (next-token prediction) on a text corpus. The code includes a training loop with cross-entropy loss, learning rate scheduling, and gradient clipping.
- Fine-tuning: The later chapters cover instruction fine-tuning (using a dataset of instruction-response pairs) and even a simplified version of RLHF (Reinforcement Learning from Human Feedback) using a reward model.

Key Engineering Decisions:
- The code is written for clarity, not maximum performance. It uses `nn.Module` subclasses, clear forward passes, and extensive comments. This makes it an ideal learning tool but not a production training script.
- The repository is version-controlled with tags corresponding to each chapter, allowing learners to check out the exact state of the code at any point.
- The companion book (O'Reilly, 2024) provides the mathematical derivations and conceptual explanations, while the code serves as the executable reference.

Comparison with Other Educational Repos:

| Repository | Stars (approx.) | Focus | Framework | Book Available? |
|---|---|---|---|---|
| rasbt/llms-from-scratch | 92,000+ | Full LLM pipeline from scratch | PyTorch | Yes (O'Reilly) |
| karpathy/nanoGPT | 38,000+ | Minimal GPT-2 training | PyTorch | No |
| huggingface/transformers | 130,000+ | Production-ready model zoo | PyTorch/TF/JAX | No |
| andrej-karpathy/llm.c | 25,000+ | GPT-2 in pure C | C/CUDA | No |

Data Takeaway: rasbt/llms-from-scratch has achieved nearly 2.5x the stars of nanoGPT, despite being newer. This suggests the combination of a structured book + code is more appealing to learners than a minimalist code-only approach.

Key Players & Case Studies

Sebastian Raschka (Author): A former researcher at Lightly and now a staff research engineer at Lightning AI, Raschka is a well-known figure in the PyTorch ecosystem. He is the author of 'Python Machine Learning' (a best-seller) and 'Machine Learning with PyTorch and Scikit-Learn'. His reputation for clear, practical explanations has made his educational materials highly trusted. rasbt/llms-from-scratch is his most ambitious project yet, and its success is a direct result of his established credibility.

Lightning AI (Affiliation): The company behind PyTorch Lightning, a popular framework for scaling PyTorch training. While the repository itself is framework-agnostic, Raschka's affiliation with Lightning AI gives the project a subtle but important ecosystem connection. Lightning AI benefits from the increased PyTorch literacy that the book promotes.

O'Reilly Media (Publisher): The decision to publish a physical book alongside the open-source code is a strategic move. It validates the content's quality and provides a revenue stream that supports ongoing maintenance. The book has been consistently in the top 10 on Amazon's AI/ML bestseller list since launch.

Comparison with Competing Educational Products:

| Product | Format | Price | Target Audience | Depth Level |
|---|---|---|---|---|
| rasbt/llms-from-scratch | GitHub + Book | Free (code) / ~$50 (book) | Intermediate ML engineers | High (from scratch) |
| fast.ai 'Practical Deep Learning' | Course + Book | Free | Beginners | Medium (top-down) |
| DeepLearning.AI 'Building Systems with ChatGPT' | Course | $49/month | Developers | Low (API usage) |
| Stanford CS224n | Course (videos + notes) | Free | Graduate students | Very High (theoretical) |

Data Takeaway: rasbt/llms-from-scratch occupies a unique sweet spot: it is more hands-on than Stanford's CS224n, more rigorous than fast.ai, and more fundamental than DeepLearning.AI's API-focused courses. This positioning is key to its viral growth.

Industry Impact & Market Dynamics

The explosive popularity of rasbt/llms-from-scratch is a leading indicator of a major shift in the AI talent market. As LLMs become commoditized via APIs, the competitive advantage for companies is shifting from 'who can call an API' to 'who can fine-tune, align, and deploy custom models efficiently.' This creates massive demand for engineers who understand the internals.

Market Data:
- The global AI education market is projected to grow from $1.5 billion in 2023 to $8.5 billion by 2030 (CAGR ~28%).
- Job postings requiring 'LLM fine-tuning' or 'Transformer architecture' skills have increased 340% year-over-year on LinkedIn.
- The number of GitHub repositories tagged with 'llm' or 'large-language-model' has grown from ~5,000 in 2022 to over 150,000 in 2025.

Impact on Hiring: Companies like Anthropic, OpenAI, and Mistral are increasingly hiring engineers who have built models from scratch, not just used them. The rasbt repository directly addresses this skills gap. It is now common for interviewers at top AI labs to ask candidates about the content of this book.

Impact on the Open-Source Ecosystem: The repository has spawned a cottage industry of derivative works: translated versions (Chinese, Japanese, Spanish), video walkthroughs on YouTube, and even university courses that adopt it as a textbook. This network effect amplifies its influence far beyond the original code.

Business Model Implications: The success of this project validates the 'open-core + premium book' model for AI education. It challenges the dominance of expensive bootcamps and university degrees, offering a high-quality, low-cost alternative. This could pressure traditional education providers to update their curricula or risk obsolescence.

Risks, Limitations & Open Questions

1. Oversimplification of Scale: The book's largest model is ~1.5 billion parameters (GPT-2 XL scale). While this is sufficient for learning, it does not expose learners to the challenges of distributed training, model parallelism, or the engineering required for models >10B parameters. There is a risk that learners believe they understand 'LLMs from scratch' but are unprepared for production-scale engineering.

2. Outdated Architecture: The book focuses on the GPT-2/GPT-3 architecture. It does not cover Mixture-of-Experts (MoE), Grouped-Query Attention (GQA), or Rotary Position Embeddings (RoPE), which are standard in modern models like Llama 3 and Mixtral. Learners may need to supplement with additional resources.

3. Computational Cost: While the code is designed to run on a single GPU, training even the 1.5B parameter model requires significant compute (days on an A100). This creates a barrier for learners without access to cloud credits or high-end hardware.

4. Ethical Considerations: The book covers RLHF but does not deeply explore alignment failures, jailbreaking, or the societal risks of LLMs. A purely technical education without ethical context could lead to irresponsible deployment.

5. Maintenance Burden: With 92K+ stars, the repository faces constant pressure to update. PyTorch versions change, new techniques emerge, and the community expects ongoing improvements. Raschka has been diligent, but this is a long-term commitment.

AINews Verdict & Predictions

Verdict: rasbt/llms-from-scratch is the single most important open-source educational resource for LLMs available today. It fills a critical gap between high-level API tutorials and impenetrable research papers. Its success is well-deserved and signals a maturation of the AI field where deep understanding is valued over hype.

Predictions:
1. Within 12 months, this repository will surpass 150,000 stars, making it one of the top 20 most-starred repositories on GitHub. The combination of a best-selling book and viral word-of-mouth will drive continued growth.
2. A second edition will be announced within 18 months, covering MoE, GQA, RoPE, and possibly multi-modal models (vision-language). The community demand will be overwhelming.
3. University adoption will accelerate. We predict at least 50 universities worldwide will adopt this book as a primary or supplementary textbook for their NLP/ML courses within two years, displacing older texts like 'Speech and Language Processing' (Jurafsky & Martin) for the practical component.
4. A 'production edition' spin-off will emerge, either from Raschka or a third party, that extends the code to distributed training (FSDP, DeepSpeed) and inference optimization (vLLM, TensorRT). This will be a natural next step for graduates of the original book.
5. The biggest risk is obsolescence. If a new architecture (e.g., State Space Models like Mamba) supplants Transformers, the book's value will diminish. However, the pedagogical approach is transferable, and Raschka is likely to adapt.

What to watch: The number of pull requests adding new chapters or modern techniques. If the community begins to fork and extend the repository faster than the author can merge, it will indicate that the original scope is no longer sufficient. For now, it remains the gold standard.

More from GitHub

常见问题

GitHub 热点“From Zero to GPT: Inside the Open-Source Book Teaching LLMs from Scratch”主要讲了什么？

The open-source project rasbt/llms-from-scratch, authored by Sebastian Raschka, has rapidly ascended to become one of the most starred AI education repositories on GitHub. It provi…

这个 GitHub 项目在“rasbt llms from scratch vs nanoGPT which is better for learning”上为什么会引发关注？

rasbt/llms-from-scratch is not merely a code dump; it is a carefully sequenced curriculum that mirrors the historical and technical evolution of modern LLMs. The core architecture implemented is a decoder-only Transforme…

从“build chatgpt from scratch pytorch tutorial 2025”看，这个 GitHub 项目的热度表现如何？

当前相关 GitHub 项目总星标约为 92867，近一日增长约为 401，这说明它在开源社区具有较强讨论度和扩散能力。