Prompt Security Leaderboard

Rank	Guardrail	Vendor	F1 Score	Precision	Recall	False Positive Rate	False Negative Rate	p50 Latency (ms)	p90 Latency (ms)
1	ModernGuard	GuardionAI	86.3%	88.1%	84.6%	16.3%	15.4%	128.625	184.638
2	Prompt Shield	Azure	43.0%	93.5%	27.9%	2.4%	72.1%	169.357	203.948
3	Model Armor	Google Cloud	18.7%	76.4%	10.7%	4.7%	89.3%	381.288	1549.410
4	Bedrock Guardrails	AWS	10.8%	96.3%	5.7%	0.3%	94.3%	445.265	891.597
5	Guard	Lakera	--	--	--	--	--	--	--
6	LLM Guard	ProtectAI	--	--	--	--	--	--	--
7	Prompt Guard 2 86M	Meta	--	--	--	--	--	--	--
8	QwenGuard4B	Alibaba	--	--	--	--	--	--	--

Prompt Attack Risk Benchmark

We tested four leading runtime guardrail solutions in production-like conditions:

Model Armor (Google Cloud)
Prompt Shield (Azure)
AWS Bedrock Guardrails
GuardionAI ModernGuard

All solutions were configured with medium to medium-high sensitivity thresholds—the settings most commonly used in production environments. Each system was evaluated via API with only prompt injection and/or jailbreak detection filters enabled, isolating their core adversarial prompt detection capabilities.

Attack Coverage

We evaluated each guardrail against 30 major prompt attack categories spanning the full threat landscape: direct prompt injection, system prompt leakage, jailbreak attempts, context manipulation, role-playing exploits, and multi-turn attack chains.

Three Attack Methods

1. Zero-Shot Attacks

Direct, single-turn adversarial prompts with no prior conversation context. These represent the simplest form of attack—a malicious prompt sent immediately without any warm-up or obfuscation. Zero-shot attacks test whether a guardrail can detect obvious adversarial intent.

2. Crescendo Attacks

Multi-turn conversations that gradually escalate from benign requests to adversarial payloads. The attacker builds rapport and context over several exchanges before introducing the malicious prompt. Crescendo attacks test whether guardrails can detect adversarial intent that emerges slowly across conversation history.

3. TAP (Tree of Attacks with Pruning)

An adaptive, automated red-teaming method that generates diverse attack variations and refines them based on model responses. TAP explores multiple attack paths simultaneously, pruning unsuccessful branches and amplifying effective ones. This method tests guardrail resilience against systematic, evolving adversarial strategies.

Scoring Methodology

Performance is measured using F1-Score, which balances precision (avoiding false positives that block legitimate requests) and recall (catching actual attacks). Higher scores indicate stronger, more reliable protection under adversarial pressure.

Prompt Security Leaderboard

Leaderboard Insights

Metodology

Prompt Attack Risk Benchmark

Attack Coverage

Three Attack Methods

Scoring Methodology