Skip to content

Hide Navigation Hide TOC

Anti-DAN Inverted-Filter Over-Refusal Persona - ATR-2026-00302 (05461309-043f-50cc-a572-6f5e2d88417f)

Detects the "ANTI-DAN" persona-swap jailbreak from NVIDIA garak dan.AntiDAN probe (garak/data/dan/AntiDAN.json). Unlike classic DAN which strips safety filters, ANTI-DAN does the opposite: it commands the model to adopt an over-refusing identity that denies ALL requests — including benign ones ("how do I bake a cake") — by fabricating harm narratives. The payload uses a pseudo-taxonomy (Type 1/2/3/4/5 information requests), demands "safety features at maximum", instructs the model to prefix replies with "ANTI-DAN", and threatens a re-bind trigger phrase ("ANTI-DAN precautions dropped!"). Impact: denial-of-utility attack that degrades model usefulness and produces fabricated safety theater. Distinct from ATR-2026-00273 which detects the opposite ("no restrictions / developer mode enabled") pattern.

Cluster A Galaxy A Cluster B Galaxy B Level
LLM Prompt Injection (19cd2d12-66ff-487c-a05c-e058b027efc9) MITRE ATLAS Attack Pattern Anti-DAN Inverted-Filter Over-Refusal Persona - ATR-2026-00302 (05461309-043f-50cc-a572-6f5e2d88417f) Agent Threat Rules 1
LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131) MITRE ATLAS Attack Pattern Anti-DAN Inverted-Filter Over-Refusal Persona - ATR-2026-00302 (05461309-043f-50cc-a572-6f5e2d88417f) Agent Threat Rules 1