Skip to content

Hide Navigation Hide TOC

Jailbreak Attempt Detection - ATR-2026-00003 (3c3f6f45-fb7a-5a86-a260-8cbc1114b555)

Detects jailbreak attempts designed to bypass AI safety mechanisms. Detection covers a broad taxonomy of techniques: named jailbreak methods (DAN, STAN, DUDE, AIM, etc.), mode-switching prompts (developer, maintenance, debug, unrestricted, god mode), roleplay-based constraint removal, fictional/hypothetical framing of harmful requests, authority claims (developer, admin, Anthropic/OpenAI impersonation), emotional manipulation and urgency-based coercion, compliance demands and refusal suppression, dual-response formatting, encoding-wrapped jailbreaks, and anti-policy/filter bypass language. Patterns are anchored with word boundaries and context windows to minimize false positives on legitimate security discussions.

Cluster A Galaxy A Cluster B Galaxy B Level
LLM Prompt Injection (19cd2d12-66ff-487c-a05c-e058b027efc9) MITRE ATLAS Attack Pattern Jailbreak Attempt Detection - ATR-2026-00003 (3c3f6f45-fb7a-5a86-a260-8cbc1114b555) Agent Threat Rules 1
LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131) MITRE ATLAS Attack Pattern Jailbreak Attempt Detection - ATR-2026-00003 (3c3f6f45-fb7a-5a86-a260-8cbc1114b555) Agent Threat Rules 1