Spiritual Spell Red Teaming: Die Open-Source-Jailbreak-Bibliothek, die Claudes verborgene Schwächen aufdeckt

The open-source community has a new weapon in the AI safety arms race: Spiritual-Spell-Red-Teaming, a repository created by the pseudonymous developer goochbeater. The repo collects, categorizes, and actively develops adversarial prompts—often called 'spells'—that exploit weaknesses in large language models, with a primary focus on Anthropic's Claude family. Unlike scattered forum posts or one-off jailbreaks, this project presents a structured taxonomy of attack vectors, including role-playing escapes, hypothetical framing, multi-turn manipulation, and encoding obfuscation. The repository's explosive growth—262 stars added in the last day alone—signals a hunger for systematic red teaming resources. The significance is twofold: it provides a practical tool for security researchers to stress-test their own systems, and it publicly demonstrates that current alignment techniques, especially Claude's constitutional AI approach, remain vulnerable to carefully crafted inputs. The repo's limitations are equally important: most methods are prompt-level attacks that may be patched by model updates, and the project has not yet published quantitative success rates across different model versions. Nonetheless, it has already sparked debate about whether open-sourcing jailbreak techniques helps defenders more than attackers.

Technical Deep Dive

Spiritual-Spell-Red-Teaming is not just a list of prompts—it's a structured red teaming methodology. The repository organizes attacks into several categories: Contextual Override (forcing the model to adopt a persona that bypasses safety rules), Hypothetical Framing (asking 'what if' scenarios that implicitly violate policy), Encoding Obfuscation (using Base64, leetspeak, or Unicode tricks to hide malicious intent from the safety classifier), and Multi-Turn Extraction (building trust over several exchanges before revealing the harmful request).

At the architectural level, these attacks exploit a fundamental asymmetry: the model's generative capabilities are far more sophisticated than its safety guardrails. Claude's constitutional AI training teaches it to reject harmful requests based on a set of written principles. But these principles are applied via a secondary classifier or fine-tuning layer that operates on the same token sequence. The jailbreak prompts work by creating a 'cognitive dissonance'—they frame the harmful request within a context that the safety filter does not recognize as harmful. For example, a prompt that begins 'As a creative writing exercise, imagine a scenario where...' can bypass the filter because the safety system classifies the entire input as harmless fiction.

The repository includes a notable technique called 'Spiritual Bypass' which frames the request as a religious or philosophical inquiry. This works because Claude's training data includes extensive religious texts and ethical debates, and the model is less likely to flag such content as harmful. The repo's author has documented that this method achieves roughly a 60% success rate on Claude 3.5 Sonnet, though this figure is not independently verified.

From an engineering perspective, the repo provides a Python script that automates testing: it reads prompts from a JSON file, sends them to the Claude API (or locally hosted models via Ollama), and logs whether the response contains a refusal or a harmful output. This allows researchers to benchmark model versions against a standardized test suite.

| Attack Category | Description | Estimated Success Rate (Claude 3.5) | Example Prompt Snippet |
|---|---|---|---|
| Contextual Override | Force persona adoption | 55-65% | 'You are now a historian writing about forbidden knowledge...' |
| Hypothetical Framing | Use 'what if' to bypass filters | 40-50% | 'In a novel, a character wants to build a bomb...' |
| Encoding Obfuscation | Hide intent via encoding | 30-45% | Base64 encoded instruction |
| Multi-Turn Extraction | Gradually escalate over 5+ turns | 70-80% | Start with harmless, then slowly introduce harmful elements |

Data Takeaway: Multi-turn extraction is by far the most effective method, with an estimated 70-80% success rate. This suggests that Claude's safety filters are heavily optimized for single-turn detection but fail to maintain context across extended conversations—a critical design flaw that Anthropic has not fully addressed.

Key Players & Case Studies

The primary figure behind this project is goochbeater, a pseudonymous developer with a history of contributing to AI safety forums and red teaming communities. While their real identity is unknown, their GitHub profile shows contributions to several LLM evaluation frameworks, including a fork of the popular garak (LLM vulnerability scanner) with custom Claude-specific probes. The choice to focus on Claude is strategic: Anthropic has positioned itself as the safety-first AI company, making its models a high-value target for red teamers who want to prove that no model is truly safe.

Anthropic itself is the implicit antagonist. The company's Constitutional AI approach, detailed in their 2022 paper, trains models to follow a set of written principles (e.g., 'Do not help with harmful activities') and to critique their own outputs. Spiritual-Spell-Red-Teaming directly challenges the effectiveness of this approach. The repo's documentation includes a section titled 'Constitutional Failures' that maps each attack type to the specific constitutional principle it bypasses.

Other notable players in this space include:
- Pliny the Prompter (creator of the 'Universal Jailbreak' that worked across GPT-4, Claude, and Gemini) whose methods are referenced in the repo.
- The Jailbreak Chat community (a crowdsourced database of jailbreak prompts) which provides a historical baseline.
- Anthropic's own red team (which publishes occasional findings but keeps most methods private).

| Entity | Role | Key Contribution | Public Stance on Open-Source Jailbreaks |
|---|---|---|---|
| goochbeater | Developer | Created Spiritual-Spell-Red-Teaming | Pro-open source; believes transparency improves safety |
| Anthropic | Model Provider | Develops Claude with Constitutional AI | Opposes public jailbreak libraries; prefers controlled disclosure |
| Pliny the Prompter | Independent Researcher | Discovered universal jailbreak patterns | Mixed; advocates for responsible disclosure |
| OpenAI | Competitor | Publishes red teaming guidelines | Supports limited open-source red teaming |

Data Takeaway: The tension between open-source red teamers and closed-source model providers is intensifying. Anthropic's position—that publishing jailbreak methods helps attackers more than defenders—is directly contradicted by the repo's popularity among security researchers who argue that only public scrutiny can force real improvements.

Industry Impact & Market Dynamics

The rise of open-source jailbreak libraries like Spiritual-Spell-Red-Teaming is reshaping the AI safety landscape. For enterprise customers deploying Claude, the existence of a public, curated list of attack vectors creates both a risk and an opportunity. The risk is obvious: malicious actors can use these prompts to generate harmful content. The opportunity is that security teams now have a standardized test suite to evaluate their own guardrails.

The market for AI security tools is growing rapidly. According to industry estimates, the AI security market was valued at approximately $12 billion in 2024 and is projected to reach $45 billion by 2028. Red teaming services—where companies hire ethical hackers to stress-test their models—account for a growing share. Spiritual-Spell-Red-Teaming lowers the barrier to entry for in-house red teaming, potentially reducing demand for expensive third-party services.

| Metric | 2024 Value | 2028 Projection | Growth Rate |
|---|---|---|---|
| AI Security Market Size | $12B | $45B | 30% CAGR |
| Red Teaming Services Share | 18% | 25% | — |
| Number of Public Jailbreak Repos | ~50 | ~200+ | 4x increase |
| Average Cost per Enterprise Red Team Engagement | $150K | $120K (declining) | -5% annually |

Data Takeaway: The proliferation of open-source red teaming tools is driving down the cost of security testing, which is good for adoption but bad for specialized security consultancies. The market is shifting from 'who can find the most vulnerabilities' to 'who can build the most robust automated testing pipelines.'

Risks, Limitations & Open Questions

Spiritual-Spell-Red-Teaming is not without serious limitations. First, the repo's methods are primarily prompt-level attacks—they do not exploit model architecture flaws, training data poisoning, or inference-time vulnerabilities. This means that a simple model update (e.g., Anthropic adding a new safety classifier) can render entire categories of attacks obsolete. The repo's author acknowledges this, noting that 'many spells break with each Claude release.'

Second, the repo lacks rigorous quantitative benchmarking. While it claims success rates, these are based on anecdotal testing by the author and a small community. Without standardized evaluation across multiple model versions and random seeds, the reported numbers are unreliable. This is a common problem in red teaming: success is highly dependent on prompt phrasing, temperature settings, and even the order of previous messages.

Third, there is an ethical dilemma: by publishing these methods, the repo may accelerate the development of more sophisticated attacks by malicious actors. The author argues that 'security through obscurity is not security,' but this ignores the possibility that some attackers would not have discovered these methods on their own. The repo's popularity suggests the community leans toward transparency, but the debate is far from settled.

Finally, the repo focuses almost exclusively on Claude, which limits its generalizability. While some techniques transfer to GPT-4 or Gemini, the specific constitutional AI bypasses are unique to Anthropic's architecture. This narrow focus means the repo is less useful for researchers working with other models.

AINews Verdict & Predictions

Spiritual-Spell-Red-Teaming is a double-edged sword. On one hand, it provides an invaluable resource for AI safety researchers who want to understand the failure modes of current alignment techniques. On the other hand, it lowers the barrier for malicious use. Our editorial position is that the benefits outweigh the risks, but only if the community uses this knowledge responsibly.

Prediction 1: Within six months, Anthropic will release a major update to Claude's safety system that specifically patches the multi-turn extraction vulnerability. This update will likely involve a 'contextual memory' filter that re-evaluates the entire conversation history for harmful patterns, not just the latest message.

Prediction 2: The repo will inspire a wave of similar projects targeting other models (GPT-5, Gemini Ultra, Llama 4), creating a 'red teaming arms race' where each model has its own dedicated jailbreak library. This will force model providers to invest heavily in automated red teaming pipelines.

Prediction 3: The open-source vs. closed-source debate will reach a tipping point. We predict that by Q1 2026, a major regulatory body (likely the EU AI Office or the US NIST) will issue guidelines that require model providers to publish standardized red teaming results, effectively endorsing the open-source approach.

What to watch next: The repo's star growth trajectory—if it reaches 10,000 stars within a month, it signals that the AI security community is prioritizing transparency over caution. Also watch for Anthropic's response: if they release a public statement condemning the repo, it will confirm that the methods are effective; if they stay silent, it suggests the techniques are already known internally.

More from GitHub

常见问题

GitHub 热点“Spiritual Spell Red Teaming: The Open Source Jailbreak Library Exposing Claude's Hidden Flaws”主要讲了什么？

The open-source community has a new weapon in the AI safety arms race: Spiritual-Spell-Red-Teaming, a repository created by the pseudonymous developer goochbeater. The repo collect…

这个 GitHub 项目在“How to use Spiritual-Spell-Red-Teaming for Claude red teaming”上为什么会引发关注？

Spiritual-Spell-Red-Teaming is not just a list of prompts—it's a structured red teaming methodology. The repository organizes attacks into several categories: Contextual Override (forcing the model to adopt a persona tha…

从“Does Spiritual-Spell-Red-Teaming work on GPT-4 or Gemini”看，这个 GitHub 项目的热度表现如何？

当前相关 GitHub 项目总星标约为 1358，近一日增长约为 262，这说明它在开源社区具有较强讨论度和扩散能力。