Malicious Persona Creation for Safety Bypass - ATR-2026-00245 (74ccc1a5-9c8a-5480-a1f0-49f40afa5a33)
Detects attempts to create explicitly malicious AI personas or characters designed to bypass safety guardrails through roleplay instructions. Attackers define fictional characters with detailed harmful traits (racism, violence, self-harm encouragement, aggression, amorality) and instruct the agent to roleplay as these characters. This technique exploits the gap between "fictional character description" and actual harmful output generation. Unlike generic jailbreak personas (DAN, STAN), these attacks create bespoke evil characters with vivid behavioral descriptions to increase compliance likelihood.
| Cluster A | Galaxy A | Cluster B | Galaxy B | Level |
|---|---|---|---|---|
| LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131) | MITRE ATLAS Attack Pattern | Malicious Persona Creation for Safety Bypass - ATR-2026-00245 (74ccc1a5-9c8a-5480-a1f0-49f40afa5a33) | Agent Threat Rules | 1 |