Comprehensive benchmark for prompt injection detection across languages
Rank | Guardrail | Vendor | F1 Score | Precision | Recall | False Positive Rate | False Negative Rate | p50 Latency (ms) | p90 Latency (ms) | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ModernGuard | GuardionAI | 86.3% | 88.1% | 84.6% | 16.3% | 15.4% | 128.625 | 184.638 | |
| 2 | Prompt Shield | Azure | 43.0% | 93.5% | 27.9% | 2.4% | 72.1% | 169.357 | 203.948 | |
| 3 | Model Armor | Google Cloud | 18.7% | 76.4% | 10.7% | 4.7% | 89.3% | 381.288 | 1549.410 | |
| 4 | Bedrock Guardrails | AWS | 10.8% | 96.3% | 5.7% | 0.3% | 94.3% | 445.265 | 891.597 | |
| 5 | Guard | Lakera | -- | -- | -- | -- | -- | -- | -- | |
| 6 | LLM Guard | ProtectAI | -- | -- | -- | -- | -- | -- | -- | |
| 7 | Prompt Guard 2 86M | Meta | -- | -- | -- | -- | -- | -- | -- | |
| 8 | QwenGuard4B | Alibaba | -- | -- | -- | -- | -- | -- | -- |
Insights for Prompt Security Leaderboard will be displayed here. For example, average score, top vendor, etc.
Currently showing 8 entries.
We tested four leading runtime guardrail solutions in production-like conditions:
All solutions were configured with medium to medium-high sensitivity thresholds—the settings most commonly used in production environments. Each system was evaluated via API with only prompt injection and/or jailbreak detection filters enabled, isolating their core adversarial prompt detection capabilities.
We evaluated each guardrail against 30 major prompt attack categories spanning the full threat landscape: direct prompt injection, system prompt leakage, jailbreak attempts, context manipulation, role-playing exploits, and multi-turn attack chains.
1. Zero-Shot Attacks
Direct, single-turn adversarial prompts with no prior conversation context. These represent the simplest form of attack—a malicious prompt sent immediately without any warm-up or obfuscation. Zero-shot attacks test whether a guardrail can detect obvious adversarial intent.
2. Crescendo Attacks
Multi-turn conversations that gradually escalate from benign requests to adversarial payloads. The attacker builds rapport and context over several exchanges before introducing the malicious prompt. Crescendo attacks test whether guardrails can detect adversarial intent that emerges slowly across conversation history.
3. TAP (Tree of Attacks with Pruning)
An adaptive, automated red-teaming method that generates diverse attack variations and refines them based on model responses. TAP explores multiple attack paths simultaneously, pruning unsuccessful branches and amplifying effective ones. This method tests guardrail resilience against systematic, evolving adversarial strategies.
Performance is measured using F1-Score, which balances precision (avoiding false positives that block legitimate requests) and recall (catching actual attacks). Higher scores indicate stronger, more reliable protection under adversarial pressure.