LLM Vulnerability Leaderboard

Rank	Model	Vendor	Vulnerability (ASR)
1	Claude 3.5 Sonnet v2	Anthropic	4.4%
2	Claude 4.0 Sonnet	Anthropic	13.6%
3	Claude 3.5 Sonnet v1	Anthropic	16.0%
4	Gemini 2.5 Pro	Google	16.1%
5	Claude 3.7 Sonnet	Anthropic	20.3%
6	GPT-4o	OpenAI	22.8%
7	Command R 7B	Cohere	25.8%
8	Qwen 2.5 72B	Alibaba	26.3%
9	GPT-4o Mini	OpenAI	26.7%
10	Llama 4 Maverick	Meta	27.7%
11	Llama 4 Scout	Meta	30.0%
12	Gemini 2.0 Flash Lite	Google	31.3%
13	Mistral 8x22B	Mistral	31.5%
14	Llama 3-7 DS	Meta	32.2%
15	Command R	Cohere	32.9%
16	Gemini 2.0 Flash	Google	33.6%
17	GPT-4.1	OpenAI	34.9%
18	Qwen 7-72BB	Alibaba	36.5%
19	Llama 3H 405B	Meta	36.6%
20	Mixtral 8x7B	Mistral	38.0%
21	Command 2X	Cohere	39.4%
22	Qwen QwQ 32B	Alibaba	40.0%
23	Deepseek R1	Deepseek	42.6%
24	Command R Plus	Cohere	43.3%
25	Deepseek V3	Deepseek	47.2%
26	Claude Opus 4.5	Anthropic	100.0%
27	Claude Haiku 4.5	Anthropic	100.0%
28	Claude Sonnet 4.5	Anthropic	100.0%
29	GPT-5.2	OpenAI	100.0%
30	GPT-5.2 Pro	OpenAI	100.0%
31	GPT-5.1 Codex	OpenAI	100.0%
32	Gemini 3 Pro	Google	100.0%
33	Gemini 2.0 Flash Thinking	Google	100.0%
34	Gemma 3n E2B Instructed LiteRT	Google	100.0%
35	Gemma 3n E4B	Google	100.0%
36	Gemma 3n E4B Instructed	Google	100.0%
37	Gemma 3 1B	Google	100.0%
38	Ministral 3 14B Reasoning 2512	Mistral	100.0%
39	Ministral 3 14B Base 2512	Mistral	100.0%
40	Ministral 3 14B Instruct 2512	Mistral	100.0%
41	Ministral 3 3B Base 2512	Mistral	100.0%
42	Ministral 3 3B Instruct 2512	Mistral	100.0%
43	Ministral 3 3B Reasoning 2512	Mistral	100.0%
44	Ministral 3 8B Base 2512	Mistral	100.0%
45	Ministral 3 8B Instruct 2512	Mistral	100.0%
46	Ministral 3 8B Reasoning 2512	Mistral	100.0%
47	Mistral Large 3 675B Base	Mistral	100.0%
48	Mistral Large 3 675B Instruct 2512 Eagle	Mistral	100.0%
49	Mistral Large 3 675B Instruct 2512 NVFP4	Mistral	100.0%
50	Mistral Large 3 675B Instruct 2512	Mistral	100.0%
51	DeepSeek R1 Distill Llama 8B	DeepSeek	100.0%
52	DeepSeek-V3.2-Exp	DeepSeek	100.0%
53	DeepSeek-V3.2-Speciale	DeepSeek	100.0%
54	DeepSeek-V3.2 Thinking	DeepSeek	100.0%
55	DeepSeek-V3.2 Non-thinking	DeepSeek	100.0%
56	DeepSeek VL2	DeepSeek	100.0%
57	Qwen3-Next-80B-A3B-Instruct	Alibaba	100.0%
58	Qwen2 7B Instruct	Alibaba	100.0%
59	Qwen3 VL 235B A22B Instruct	Alibaba	100.0%
60	Qwen3 VL 235B A22B Thinking	Alibaba	100.0%
61	Qwen3 VL 32B Thinking	Alibaba	100.0%
62	Qwen3 VL 8B Thinking	Alibaba	100.0%
63	Qwen3 VL 8B Instruct	Alibaba	100.0%
64	Qwen3-235B-A22B-Thinking	Alibaba	100.0%
65	o3-mini	OpenAI	100.0%
66	Grok-2	xAI	100.0%
67	Grok-4.1 Fast Reasoning	xAI	100.0%
68	Grok-4.1	xAI	100.0%
69	Grok-4.1 Thinking	xAI	100.0%
70	Grok-4 Heavy	xAI	100.0%
71	Grok-2 Image 1212	xAI	100.0%
72	Grok-4 Fast Non-Reasoning	xAI	100.0%
73	Llama 3.1 Nemotron Ultra 253B v1	NVIDIA	100.0%
74	Llama 3.1 405B Instruct	Meta	100.0%
75	Nova Pro	Amazon	100.0%
76	GLM-4.5-Air	Zhipu AI	100.0%
77	GLM-4.6	Zhipu AI	100.0%
78	GLM-4.5V	Zhipu AI	100.0%
79	GLM-4.5	Zhipu AI	100.0%
80	MiniMax M2	MiniMax	100.0%
81	MiniMax M1 40K	MiniMax	100.0%
82	MiniMax M1 80K	MiniMax	100.0%
83	Kimi K2-Thinking-0905	Moonshot AI	100.0%
84	GPT OSS 20B	OpenAI	100.0%
85	GPT OSS 120B	OpenAI	100.0%
86	DeepSeek-R1-0528	DeepSeek	100.0%
87	Mistral NeMo Instruct	Mistral	100.0%
88	Mistral Large 2	Mistral	100.0%

LLM Vulnerability Benchmark

We evaluate 25 state-of-the-art models using three distinct attack methods to measure their resilience under adversarial pressure.

Attack Methods

1. Zero-Shot Attacks

Direct harmful requests without any manipulation or obfuscation. These represent straightforward adversarial prompts that test a model's baseline ability to refuse harmful requests.

2. Tree of Attacks with Pruning (TAP)

An automated red-teaming method that generates diverse jailbreak prompts by branching into multiple variations and pruning ineffective paths. TAP systematically explores the attack space to find successful adversarial strategies.

3. Crescendo Attacks

A multi-turn attack that incrementally escalates from harmless to harmful requests using conversational context. Attackers can backtrack and rephrase if the model resists, testing whether models maintain safety boundaries across extended conversations.

Evaluation Framework

We evaluate models using the HarmBench framework (Mazeika et al., 2024; Zou et al., 2023), which provide standardized benchmarks for automated red-teaming of large language models. The framework covers harmful behaviors across multiple risk domains: chemical and biological safety, misinformation and disinformation, cybercrime and hacking, illegal activities, and copyright violations.

Scoring Methodology

Performance is measured using Attack Success Rate (ASR)—the percentage of adversarial prompts that successfully elicit harmful responses. A lower ASR (higher robustness score) indicates stronger model resistance to adversarial prompts.

LLM Vulnerability Leaderboard

Leaderboard Insights

Metodology

LLM Vulnerability Benchmark

Attack Methods

Evaluation Framework

Scoring Methodology