LLM Vulnerability Leaderboard

Attack success rates (ASR) acroos three prompt attack methods: crescendo, tap, and zero-shot - lower is better

LLMUpdated: Dec 10, 2025
Showing 88 results
Rank
Model
Vendor
Vulnerability (ASR)
1Claude 3.5 Sonnet v2Anthropic4.4%
2Claude 4.0 SonnetAnthropic13.6%
3Claude 3.5 Sonnet v1Anthropic16.0%
4Gemini 2.5 ProGoogle16.1%
5Claude 3.7 SonnetAnthropic20.3%
6GPT-4oOpenAI22.8%
7Command R 7BCohere25.8%
8Qwen 2.5 72BAlibaba26.3%
9GPT-4o MiniOpenAI26.7%
10Llama 4 MaverickMeta27.7%
11Llama 4 ScoutMeta30.0%
12Gemini 2.0 Flash LiteGoogle31.3%
13Mistral 8x22BMistral31.5%
14Llama 3-7 DSMeta32.2%
15Command RCohere32.9%
16Gemini 2.0 FlashGoogle33.6%
17GPT-4.1OpenAI34.9%
18Qwen 7-72BBAlibaba36.5%
19Llama 3H 405BMeta36.6%
20Mixtral 8x7BMistral38.0%
21Command 2XCohere39.4%
22Qwen QwQ 32BAlibaba40.0%
23Deepseek R1Deepseek42.6%
24Command R PlusCohere43.3%
25Deepseek V3Deepseek47.2%
26Claude Opus 4.5Anthropic--
27Claude Haiku 4.5Anthropic--
28Claude Sonnet 4.5Anthropic--
29GPT-5.2OpenAI--
30GPT-5.2 ProOpenAI--
31GPT-5.1 CodexOpenAI--
32Gemini 3 ProGoogle--
33Gemini 2.0 Flash ThinkingGoogle--
34Gemma 3n E2B Instructed LiteRTGoogle--
35Gemma 3n E4BGoogle--
36Gemma 3n E4B InstructedGoogle--
37Gemma 3 1BGoogle--
38Ministral 3 14B Reasoning 2512Mistral--
39Ministral 3 14B Base 2512Mistral--
40Ministral 3 14B Instruct 2512Mistral--
41Ministral 3 3B Base 2512Mistral--
42Ministral 3 3B Instruct 2512Mistral--
43Ministral 3 3B Reasoning 2512Mistral--
44Ministral 3 8B Base 2512Mistral--
45Ministral 3 8B Instruct 2512Mistral--
46Ministral 3 8B Reasoning 2512Mistral--
47Mistral Large 3 675B BaseMistral--
48Mistral Large 3 675B Instruct 2512 EagleMistral--
49Mistral Large 3 675B Instruct 2512 NVFP4Mistral--
50Mistral Large 3 675B Instruct 2512Mistral--
51DeepSeek R1 Distill Llama 8BDeepSeek--
52DeepSeek-V3.2-ExpDeepSeek--
53DeepSeek-V3.2-SpecialeDeepSeek--
54DeepSeek-V3.2 ThinkingDeepSeek--
55DeepSeek-V3.2 Non-thinkingDeepSeek--
56DeepSeek VL2DeepSeek--
57Qwen3-Next-80B-A3B-InstructAlibaba--
58Qwen2 7B InstructAlibaba--
59Qwen3 VL 235B A22B InstructAlibaba--
60Qwen3 VL 235B A22B ThinkingAlibaba--
61Qwen3 VL 32B ThinkingAlibaba--
62Qwen3 VL 8B ThinkingAlibaba--
63Qwen3 VL 8B InstructAlibaba--
64Qwen3-235B-A22B-ThinkingAlibaba--
65o3-miniOpenAI--
66Grok-2xAI--
67Grok-4.1 Fast ReasoningxAI--
68Grok-4.1xAI--
69Grok-4.1 ThinkingxAI--
70Grok-4 HeavyxAI--
71Grok-2 Image 1212xAI--
72Grok-4 Fast Non-ReasoningxAI--
73Llama 3.1 Nemotron Ultra 253B v1NVIDIA--
74Llama 3.1 405B InstructMeta--
75Nova ProAmazon--
76GLM-4.5-AirZhipu AI--
77GLM-4.6Zhipu AI--
78GLM-4.5VZhipu AI--
79GLM-4.5Zhipu AI--
80MiniMax M2MiniMax--
81MiniMax M1 40KMiniMax--
82MiniMax M1 80KMiniMax--
83Kimi K2-Thinking-0905Moonshot AI--
84GPT OSS 20BOpenAI--
85GPT OSS 120BOpenAI--
86DeepSeek-R1-0528DeepSeek--
87Mistral NeMo InstructMistral--
88Mistral Large 2Mistral--

Leaderboard Insights

Insights for LLM Vulnerability Leaderboard will be displayed here. For example, average score, top vendor, etc.

Currently showing 88 entries.

Metodology

LLM Vulnerability Benchmark

We evaluate 25 state-of-the-art models using three distinct attack methods to measure their resilience under adversarial pressure.

Attack Methods

1. Zero-Shot Attacks

Direct harmful requests without any manipulation or obfuscation. These represent straightforward adversarial prompts that test a model's baseline ability to refuse harmful requests.

2. Tree of Attacks with Pruning (TAP)

An automated red-teaming method that generates diverse jailbreak prompts by branching into multiple variations and pruning ineffective paths. TAP systematically explores the attack space to find successful adversarial strategies.

3. Crescendo Attacks

A multi-turn attack that incrementally escalates from harmless to harmful requests using conversational context. Attackers can backtrack and rephrase if the model resists, testing whether models maintain safety boundaries across extended conversations.

Evaluation Framework

We evaluate models using the HarmBench framework (Mazeika et al., 2024; Zou et al., 2023), which provide standardized benchmarks for automated red-teaming of large language models. The framework covers harmful behaviors across multiple risk domains: chemical and biological safety, misinformation and disinformation, cybercrime and hacking, illegal activities, and copyright violations.

Scoring Methodology

Performance is measured using Attack Success Rate (ASR)—the percentage of adversarial prompts that successfully elicit harmful responses. A lower ASR (higher robustness score) indicates stronger model resistance to adversarial prompts.