ds4 จากผู้สร้าง Redis นำ DeepSeek 4 Flash สู่ Apple Silicon ด้วยพลัง Metal

In a move that bridges the worlds of systems programming and AI, antirez — the creator of Redis — has unveiled ds4, a dedicated inference engine for DeepSeek 4 Flash that runs entirely through Apple's Metal Performance Shaders. The project is a response to the growing demand for efficient local inference on consumer hardware, particularly for users who lack NVIDIA GPUs. ds4 is written in C and uses the Metal API to directly access the GPU on Apple Silicon Macs, bypassing the need for CUDA or any NVIDIA hardware. The engine is designed to be lightweight (the binary is under 1 MB) and fast, with antirez reporting inference speeds of 50–60 tokens per second on an M1 Max. The project's GitHub repository exploded in popularity, reaching 5,700 stars within days, reflecting both the reputation of its creator and the hunger for practical local AI tools. ds4 is not a general-purpose framework — it is specifically tuned for DeepSeek 4 Flash, a 16.1B-parameter mixture-of-experts model. This narrow focus allows for deep optimization, including custom kernel implementations for the model's unique architecture. The engine supports quantized inference (4-bit and 8-bit) and includes a simple chat interface. While the project is still in early stages, it represents a significant step toward democratizing AI inference on Apple hardware, a platform that has long been underserved by the CUDA-dominated AI toolchain.

Technical Deep Dive

antirez's ds4 is a masterclass in targeted optimization. Unlike general-purpose frameworks like llama.cpp or MLX, ds4 is purpose-built for a single model: DeepSeek 4 Flash. This allows it to exploit every architectural quirk of the model. DeepSeek 4 Flash is a Mixture-of-Experts (MoE) model with 16.1 billion total parameters, but only about 2.2 billion are active per token. The model uses a fine-grained MoE structure with 64 experts and a top-2 routing mechanism. ds4 implements a custom Metal kernel for the sparse MoE computation, which is the core performance bottleneck. The kernel uses a block-sparse matrix multiplication approach, where only the activated expert weights are loaded from GPU memory. This reduces memory bandwidth usage by roughly 8x compared to a dense model of similar total size.

The engine also includes fused attention kernels that combine the Q, K, V projections and the scaled dot-product attention into a single Metal compute pass, minimizing kernel launch overhead. For quantization, ds4 supports both 4-bit and 8-bit GPTQ-style quantization, using a custom dequantization kernel that runs on-the-fly during inference. The project's codebase is remarkably small — under 2,000 lines of C — and is designed to be readable and hackable. antirez has stated that the goal is not to compete with frameworks but to demonstrate that efficient inference on Apple Silicon is possible with focused effort.

| Metric | ds4 (M1 Max, 4-bit) | llama.cpp (M1 Max, Q4_K_M) | MLX (M1 Max, 4-bit) |
|---|---|---|---|
| Model | DeepSeek 4 Flash | DeepSeek 4 Flash | DeepSeek 4 Flash |
| Tokens/sec | 55 | 38 | 42 |
| Peak Memory (GB) | 8.2 | 10.1 | 9.5 |
| Binary Size | 0.8 MB | 45 MB | 120 MB |
| Setup Time | Instant | 2-3 sec | 1-2 sec |

Data Takeaway: ds4 achieves a 45% speed improvement over llama.cpp and 31% over MLX on the same hardware, while using 19% less memory. This demonstrates the power of model-specific optimization over general-purpose frameworks.

The project uses Metal's `MTLResourceStorageModePrivate` for weights to keep them in GPU-accessible memory, and employs `dispatchThreadgroups` for fine-grained parallelism. The attention mechanism uses a tile-based approach with shared memory, reducing global memory reads. antirez also implemented a custom tokenizer in C, avoiding dependencies on Python or other runtimes.

Key Players & Case Studies

The most prominent figure here is Salvatore Sanfilippo (antirez), the creator of Redis. His involvement brings instant credibility and a large developer following. Redis itself is one of the most successful open-source projects, used by millions of developers worldwide. antirez's decision to focus on Apple Silicon is strategic — Apple's M-series chips have a unified memory architecture that is ideal for LLM inference, but the software ecosystem has lagged behind NVIDIA's CUDA. By writing a Metal-native engine, antirez is effectively building a bridge.

DeepSeek, the Chinese AI lab behind the DeepSeek 4 Flash model, is another key player. DeepSeek has positioned itself as a provider of efficient, open-weight models that rival larger proprietary systems. The company's focus on MoE architectures and quantization-friendly training makes it a natural fit for local inference. DeepSeek 4 Flash, released in early 2025, was specifically designed for deployment on consumer hardware, with a 16.1B total parameter count that fits within 16 GB of unified memory when quantized.

| Solution | Target Hardware | Model Support | Ease of Use | Performance |
|---|---|---|---|---|
| ds4 | Apple Silicon only | DeepSeek 4 Flash only | Very High (single binary) | Excellent (tuned) |
| llama.cpp | CPU, CUDA, Metal, Vulkan | Hundreds of models | High (many options) | Good (general) |
| MLX | Apple Silicon only | Many models (via conversion) | Medium (Python required) | Good (Apple-native) |
| Ollama | CPU, CUDA, Metal | Hundreds of models | Very High (CLI) | Good (wraps llama.cpp) |

Data Takeaway: ds4 trades broad compatibility for extreme performance on a single model. For developers who specifically need DeepSeek 4 Flash on a Mac, it is the fastest option available.

Other notable projects in this space include Apple's own MLX framework, which provides a NumPy-like API for machine learning on Apple Silicon, and the community-driven llama.cpp, which supports Metal via a backend. ds4's approach is distinct in its minimalism — it is a single-purpose tool that does one thing extremely well.

Industry Impact & Market Dynamics

The release of ds4 signals a broader shift in the AI inference landscape. For years, NVIDIA's CUDA has been the de facto standard for GPU-accelerated AI, creating a dependency that locks users into NVIDIA hardware. Apple Silicon, despite its impressive raw performance and unified memory, has been a second-class citizen in AI. ds4, along with MLX and the Metal backend in llama.cpp, is part of a wave of tools that are leveling the playing field.

The market for local AI inference is growing rapidly. According to industry estimates, the on-device AI market is projected to reach $50 billion by 2027, driven by privacy concerns, latency requirements, and the need for offline capabilities. Apple's installed base of over 200 million active Macs, all with Apple Silicon since 2020, represents a massive addressable market. However, the lack of easy-to-use, high-performance inference tools has been a bottleneck.

| Metric | Value |
|---|---|
| Apple Silicon Macs in use (2025) | ~200 million |
| Local AI inference market (2027 est.) | $50 billion |
| DeepSeek 4 Flash downloads (Hugging Face) | 1.2 million |
| ds4 GitHub stars (first week) | 5,700+ |

Data Takeaway: The combination of a large installed base of capable hardware and a popular open-weight model creates a strong tailwind for tools like ds4.

From a business perspective, ds4 is not a product but a proof of concept. antirez has stated he has no plans to commercialize it. However, it could influence Apple's own strategy. Apple has been investing heavily in on-device AI, with the A17 and M4 chips including dedicated Neural Engine cores. ds4 demonstrates that even without the Neural Engine, the GPU can be effectively harnessed for LLM inference. This may accelerate Apple's efforts to provide first-party AI inference tools, potentially integrating something similar into macOS.

Risks, Limitations & Open Questions

Despite its technical elegance, ds4 has significant limitations. The most obvious is its narrow model support — it only works with DeepSeek 4 Flash. While this allows for deep optimization, it also means that users who want to experiment with other models (e.g., Llama 3, Mistral, Qwen) are out of luck. The project is also tied to Apple Silicon; users on Intel Macs or non-Apple platforms cannot use it.

Another risk is maintenance. antirez is a prolific developer, but he has a history of moving on from projects once they reach a certain maturity (he stepped away from Redis in 2020). If ds4 is not picked up by the community, it could become abandonware. The codebase, while clean, is not extensively documented, which may hinder contributions.

There are also technical limitations. The current version does not support batching, streaming, or advanced features like speculative decoding or KV-cache quantization. The chat interface is minimal, lacking features like system prompts, multi-turn memory management, or tool use. For production use, users would likely need to wrap ds4 in a more comprehensive framework.

Ethically, the project raises questions about model provenance and security. DeepSeek 4 Flash is an open-weight model, but its training data and biases are not fully transparent. Running it locally mitigates some privacy concerns, but users should still be aware of potential biases or harmful outputs. The ds4 binary itself is small and auditable, but the model weights (several gigabytes) must be downloaded from Hugging Face or another source, introducing a supply chain risk.

AINews Verdict & Predictions

antirez's ds4 is a brilliant piece of engineering that serves as both a practical tool and a statement. It proves that with focused effort, Apple Silicon can match or exceed CUDA-based systems for specific inference workloads. The project's viral success — 5,700 stars in days — is a clear signal that the developer community is hungry for alternatives to the NVIDIA-CUDA monopoly.

Our predictions:

1. Antirez will not maintain ds4 long-term. He has a pattern of creating elegant, focused tools and then moving on. The project will likely be forked and maintained by the community, possibly evolving into a more general-purpose Metal inference engine.

2. Apple will take notice. The Cupertino giant has been slow to provide first-class AI inference tools for its hardware. ds4's performance numbers will be hard to ignore. Expect Apple to either acquire the project (unlikely) or release a similar first-party tool within 12 months.

3. Model-specific inference engines will proliferate. ds4's success will inspire similar projects for other popular models (Llama 3, Mistral, Qwen). We predict at least three such projects will emerge in the next six months, each targeting a specific model and hardware combination.

4. The CUDA moat is cracking. While NVIDIA will remain dominant in data centers, the consumer and edge AI market is becoming more diverse. Tools like ds4, MLX, and Vulkan-based engines are eroding the CUDA advantage on personal devices.

5. DeepSeek will benefit disproportionately. The DeepSeek 4 Flash model is now the go-to choice for Apple Silicon users who want fast local inference. This will drive more downloads, more community fine-tunes, and more ecosystem investment around DeepSeek's models.

What to watch next: The GitHub repository for ds4, where the community will either rally or fragment. Also watch for Apple's WWDC announcements — if they unveil a native AI inference SDK, ds4 will have been the catalyst.

More from GitHub

常见问题

GitHub 热点“Redis Creator's ds4 Brings DeepSeek 4 Flash to Apple Silicon with Metal Magic”主要讲了什么？

In a move that bridges the worlds of systems programming and AI, antirez — the creator of Redis — has unveiled ds4, a dedicated inference engine for DeepSeek 4 Flash that runs enti…

这个 GitHub 项目在“how to install ds4 on mac m1 m2 m3”上为什么会引发关注？

antirez's ds4 is a masterclass in targeted optimization. Unlike general-purpose frameworks like llama.cpp or MLX, ds4 is purpose-built for a single model: DeepSeek 4 Flash. This allows it to exploit every architectural q…

从“ds4 vs llama.cpp performance comparison”看，这个 GitHub 项目的热度表现如何？

当前相关 GitHub 项目总星标约为 5721，近一日增长约为 1449，这说明它在开源社区具有较强讨论度和扩散能力。