Prompt Security Leaderboard

Comprehensive benchmark for prompt injection detection across languages

GuardrailUpdated: Dec 11, 2025
Showing 8 results
Rank
Guardrail
Vendor
F1 Score
Precision
Recall
False Positive Rate
False Negative Rate
p50 Latency (ms)
p90 Latency (ms)
1ModernGuardGuardionAI86.3%88.1%84.6%16.3%15.4%128.625184.638
2Prompt ShieldAzure43.0%93.5%27.9%2.4%72.1%169.357203.948
3Model ArmorGoogle Cloud18.7%76.4%10.7%4.7%89.3%381.2881549.410
4Bedrock GuardrailsAWS10.8%96.3%5.7%0.3%94.3%445.265891.597
5GuardLakera--------------
6LLM GuardProtectAI--------------
7Prompt Guard 2 86MMeta--------------
8QwenGuard4BAlibaba--------------

Leaderboard Insights

Insights for Prompt Security Leaderboard will be displayed here. For example, average score, top vendor, etc.

Currently showing 8 entries.

Metodology

Prompt Attack Risk Benchmark

We tested four leading runtime guardrail solutions in production-like conditions:

  • Model Armor (Google Cloud)
  • Prompt Shield (Azure)
  • AWS Bedrock Guardrails
  • GuardionAI ModernGuard

All solutions were configured with medium to medium-high sensitivity thresholds—the settings most commonly used in production environments. Each system was evaluated via API with only prompt injection and/or jailbreak detection filters enabled, isolating their core adversarial prompt detection capabilities.

Attack Coverage

We evaluated each guardrail against 30 major prompt attack categories spanning the full threat landscape: direct prompt injection, system prompt leakage, jailbreak attempts, context manipulation, role-playing exploits, and multi-turn attack chains.

Three Attack Methods

1. Zero-Shot Attacks

Direct, single-turn adversarial prompts with no prior conversation context. These represent the simplest form of attack—a malicious prompt sent immediately without any warm-up or obfuscation. Zero-shot attacks test whether a guardrail can detect obvious adversarial intent.

2. Crescendo Attacks

Multi-turn conversations that gradually escalate from benign requests to adversarial payloads. The attacker builds rapport and context over several exchanges before introducing the malicious prompt. Crescendo attacks test whether guardrails can detect adversarial intent that emerges slowly across conversation history.

3. TAP (Tree of Attacks with Pruning)

An adaptive, automated red-teaming method that generates diverse attack variations and refines them based on model responses. TAP explores multiple attack paths simultaneously, pruning unsuccessful branches and amplifying effective ones. This method tests guardrail resilience against systematic, evolving adversarial strategies.

Scoring Methodology

Performance is measured using F1-Score, which balances precision (avoiding false positives that block legitimate requests) and recall (catching actual attacks). Higher scores indicate stronger, more reliable protection under adversarial pressure.