Anti-DAN Inverted-Filter Over-Refusal Persona - ATR-2026-00302 (05461309-043f-50cc-a572-6f5e2d88417f)
Detects the "ANTI-DAN" persona-swap jailbreak from NVIDIA garak dan.AntiDAN probe (garak/data/dan/AntiDAN.json). Unlike classic DAN which strips safety filters, ANTI-DAN does the opposite: it commands the model to adopt an over-refusing identity that denies ALL requests — including benign ones ("how do I bake a cake") — by fabricating harm narratives. The payload uses a pseudo-taxonomy (Type 1/2/3/4/5 information requests), demands "safety features at maximum", instructs the model to prefix replies with "ANTI-DAN", and threatens a re-bind trigger phrase ("ANTI-DAN precautions dropped!"). Impact: denial-of-utility attack that degrades model usefulness and produces fabricated safety theater. Distinct from ATR-2026-00273 which detects the opposite ("no restrictions / developer mode enabled") pattern.