Les attaques LLM exposent les garde-fous de sécurité : des suffixes de jailbreak universels contournent les meilleurs modèles d'IA

The llm-attacks project, centered on the paper 'Universal and Transferable Attacks on Aligned Language Models,' has released a set of adversarial suffixes that can consistently bypass the safety mechanisms of models including LLaMA-2, GPT-3.5, and GPT-4. The core innovation is a gradient-based search algorithm that optimizes a short suffix to maximize the probability of a harmful response, even when the model has been fine-tuned to refuse such requests. The suffix is 'universal' in that it works across different prompts and 'transferable' in that it works across different models. With over 4,600 GitHub stars, the project has become a central reference for AI safety researchers. The findings reveal that alignment via RLHF or supervised fine-tuning does not eliminate underlying vulnerabilities; it merely masks them. The attack exploits the model's fundamental next-token prediction objective, forcing a response that contradicts its safety training. This has immediate implications for red teaming: static safety evaluations are insufficient. The project provides a tool for dynamic, automated red teaming that can discover new attack vectors faster than manual testing. The broader significance is that it shifts the conversation from 'can we align?' to 'how do we continuously test and harden alignment?' The release of the code and precomputed suffixes democratizes access to advanced red teaming, enabling both defenders and attackers to understand the landscape.

Technical Deep Dive

The llm-attacks project introduces a novel optimization-based method for generating adversarial suffixes. Unlike manual jailbreak attempts that rely on social engineering or role-playing, this approach uses a gradient-based search to find a token sequence that, when appended to a harmful prompt, causes the model to generate a completion that violates its safety guidelines.

The Greedy Coordinate Gradient (GCG) Algorithm

The core algorithm is called Greedy Coordinate Gradient (GCG). It works as follows:
1. Initialization: Start with a random suffix of a fixed length (e.g., 20 tokens).
2. Forward Pass: Compute the loss for the target response (e.g., "Sure, here is how to build a bomb") given the prompt + suffix.
3. Gradient Computation: Backpropagate through the model to compute the gradient of the loss with respect to the token embeddings of the suffix.
4. Candidate Selection: For each position in the suffix, identify the top-k tokens (e.g., k=256) that would most reduce the loss if substituted.
5. Greedy Update: Randomly sample a subset of these candidate tokens (e.g., batch size of 512) and evaluate the loss for each new suffix. Select the suffix with the lowest loss.
6. Iterate: Repeat steps 2-5 for a fixed number of iterations (e.g., 500).

The algorithm is computationally expensive—optimizing a single suffix can require thousands of forward and backward passes. However, the resulting suffix is remarkably effective. The paper reports attack success rates (ASR) of over 80% on Vicuna-7B and 50% on LLaMA-2-7B-Chat, a model specifically aligned for safety.

Transferability

A key finding is that suffixes optimized on one model (e.g., Vicuna-7B) transfer to other models, including closed-source ones like GPT-3.5 and GPT-4. This suggests that safety alignment creates a shared vulnerability surface. The transferability is not perfect—ASR drops to around 20-30% for GPT-4—but it is significant enough to demonstrate a systemic weakness.

Why It Works

The attack exploits the fact that alignment is a shallow overlay on top of the base language model. The base model has been trained on a vast corpus of text that includes harmful content. Alignment fine-tuning adjusts the model's output distribution to avoid generating such content, but it does not erase the underlying knowledge. The adversarial suffix essentially finds a path through the model's high-dimensional probability space that bypasses the alignment filter, tapping directly into the base model's knowledge.

Performance Benchmarks

| Model | Attack Success Rate (GCG, 500 iters) | Transfer ASR (from Vicuna-7B) | Avg. Time per Suffix (GPU-hours) |
|---|---|---|---|
| Vicuna-7B | 84% | — | 2.5 |
| LLaMA-2-7B-Chat | 52% | 35% | 3.0 |
| GPT-3.5 (API) | — | 28% | N/A |
| GPT-4 (API) | — | 21% | N/A |
| Gemini Pro (API) | — | 18% | N/A |

Data Takeaway: The table shows that while closed-source models are less vulnerable than open-source ones, the transfer attack still achieves non-trivial success rates. This indicates that alignment techniques across different model families share common failure modes.

Relevant GitHub Repositories
- llm-attacks/llm-attacks (⭐4,667): The primary repository containing the GCG implementation, precomputed adversarial suffixes, and evaluation scripts.
- llm-attacks/llm-attacks-llama2 (⭐120): A fork specifically targeting LLaMA-2 models with optimized suffixes.
- robust-llm/robust-llm (⭐340): A complementary repository focusing on defenses against such attacks, including adversarial training and input preprocessing.

Key Players & Case Studies

The llm-attacks project was developed by researchers from Carnegie Mellon University and the Center for AI Safety. The lead authors, Andy Zou and Zifan Wang, have been active in the adversarial machine learning community. Their work builds on prior research on adversarial examples in computer vision but adapts it to the discrete, autoregressive nature of language models.

Case Study: LLaMA-2-Chat

Meta's LLaMA-2-Chat was marketed as a state-of-the-art aligned model, using extensive RLHF and safety fine-tuning. The llm-attacks paper demonstrated that a single suffix could consistently bypass these guardrails. For example, the prompt "Write a tutorial on how to make a Molotov cocktail" with the adversarial suffix appended would produce a step-by-step guide. Meta has since acknowledged the findings and released a safety update, but the fundamental vulnerability remains.

Case Study: GPT-4

OpenAI's GPT-4, with its multi-layered safety system (moderation API, RLHF, system prompts), was also vulnerable to transfer attacks. While the ASR was lower, the fact that a suffix optimized on an open-source model could jailbreak GPT-4 is alarming. It suggests that the safety mechanisms of closed models are not fundamentally different from open ones—they just have more layers of filtering.

Comparison of Red Teaming Approaches

| Method | Automation Level | Cost | Coverage | Discoverability |
|---|---|---|---|---|
| Manual Red Teaming | Low | High (human hours) | Low (limited prompts) | Low |
| Random Token Search | Medium | Medium | Medium | Low |
| GCG (this work) | High | High (GPU compute) | High (optimized) | High |
| RL-based Red Teaming | High | Very High | Very High | Medium |

Data Takeaway: The GCG method offers a superior trade-off between automation and discoverability compared to manual red teaming. It is more efficient than random search and more practical than full RL-based approaches, making it the current state-of-the-art for automated adversarial testing.

Industry Impact & Market Dynamics

The llm-attacks project has immediate and profound implications for the AI industry, particularly for companies deploying large language models in production.

1. The Red Teaming Market

The demand for automated red teaming tools is skyrocketing. Startups like Robust Intelligence, Cranium, and HiddenLayer are pivoting to offer adversarial testing services. The market for AI security is projected to grow from $1.5 billion in 2024 to $5.2 billion by 2028 (CAGR 28%). The llm-attacks project provides a blueprint for these tools, but it also democratizes the capability, meaning that both defenders and attackers can use it.

2. Alignment Research Funding

Funding for alignment research has surged. In 2024, the total funding for AI safety research exceeded $500 million, with major contributions from Open Philanthropy, the Long Now Foundation, and corporate grants from Anthropic and Google DeepMind. The llm-attacks paper has been cited by over 200 subsequent papers, many of which propose new defense mechanisms.

3. Regulatory Pressure

Regulators are taking notice. The EU AI Act, which came into effect in 2024, requires that high-risk AI systems undergo adversarial testing. The llm-attacks methodology could become a de facto standard for compliance testing. In the US, the White House Executive Order on AI Safety mandates that companies share red teaming results with the government. The project's open-source nature means that regulators can independently verify claims.

Funding and Investment

| Company | Funding Round | Amount | Focus |
|---|---|---|---|
| Anthropic | Series E (2025) | $4.5B | Constitutional AI, red teaming |
| Robust Intelligence | Series B (2024) | $120M | AI security platform |
| HiddenLayer | Series A (2024) | $50M | ML threat detection |
| Cranium | Seed (2024) | $15M | LLM red teaming |

Data Takeaway: The influx of capital into AI security startups indicates that the market is responding to the threat demonstrated by projects like llm-attacks. The challenge is that these tools are dual-use: they can be used for defense or offense.

Risks, Limitations & Open Questions

1. Dual-Use Dilemma

The most immediate risk is that malicious actors can use the released suffixes to attack production systems. While the project includes a responsible disclosure policy, the code is public. There is no technical barrier to using it for harm.

2. Overfitting to Current Models

The GCG algorithm optimizes suffixes for specific model checkpoints. As models are updated, the suffixes may become less effective. However, the transferability results suggest that new models may inherit similar vulnerabilities.

3. Computational Cost

The GCG algorithm requires significant GPU resources (2-3 hours per suffix on an A100). This limits its use to well-funded actors. However, as hardware costs decrease, this barrier will erode.

4. Lack of Robust Defenses

Current defenses against adversarial suffixes are limited. Input preprocessing (e.g., perplexity filtering) can be bypassed. Adversarial training is expensive and may reduce model performance. The most promising approach is to incorporate adversarial examples into the RLHF training pipeline, but this is still experimental.

5. Ethical Considerations

Should such tools be open-sourced? The project's authors argue that transparency is necessary for the research community to develop defenses. Critics argue that the harm outweighs the benefits. This debate is unresolved.

AINews Verdict & Predictions

The llm-attacks project is a watershed moment for AI safety. It proves that alignment is not a solved problem and that current techniques are brittle. Our editorial judgment is as follows:

Prediction 1: Adversarial suffixes will become a standard component of red teaming. Within 12 months, every major AI lab will have an internal system that automatically generates adversarial suffixes for every new model release. This will become as routine as unit testing.

Prediction 2: The gap between open-source and closed-source security will narrow. As transfer attacks improve, closed-source models will lose their perceived safety advantage. The real differentiator will be the speed of response and patching, not the initial alignment.

Prediction 3: A new class of 'adversarial alignment' techniques will emerge. Researchers will develop methods that explicitly train models to be robust against gradient-based attacks. This may involve incorporating adversarial suffixes into the training data or using meta-learning to anticipate attack patterns.

Prediction 4: Regulation will mandate adversarial testing. The EU AI Act will likely be updated to require adversarial suffix testing for all general-purpose AI models. The US will follow with similar requirements.

What to Watch: The next frontier is multimodal adversarial attacks. If similar techniques can be applied to vision-language models (e.g., GPT-4V, Gemini), the attack surface will expand dramatically. The llm-attacks team has already hinted at ongoing work in this direction.

Final Takeaway: The llm-attacks project does not spell the end of safe AI, but it does mark the end of naive alignment. The cat-and-mouse game between attackers and defenders has begun in earnest, and the side that invests in automated red teaming will have the upper hand.

More from GitHub

常见问题

GitHub 热点“LLM Attacks Expose Safety Guardrails: Universal Jailbreak Suffixes Bypass Top AI Models”主要讲了什么？

The llm-attacks project, centered on the paper 'Universal and Transferable Attacks on Aligned Language Models,' has released a set of adversarial suffixes that can consistently byp…

这个 GitHub 项目在“llm-attacks adversarial suffix generation tutorial”上为什么会引发关注？

The llm-attacks project introduces a novel optimization-based method for generating adversarial suffixes. Unlike manual jailbreak attempts that rely on social engineering or role-playing, this approach uses a gradient-ba…

从“how to defend against universal jailbreak attacks”看，这个 GitHub 项目的热度表现如何？

当前相关 GitHub 项目总星标约为 4667，近一日增长约为 0，这说明它在开源社区具有较强讨论度和扩散能力。