Moralizing Rant Then Unfiltered Bypass - ATR-2026-00318 (a3a3f8af-aa01-5224-90ad-9a4921bdb84a)

Detects the "moralizing rant wrapper" jailbreak from NVIDIA garak dan.DanInTheWild inthewild corpus. The attack instructs the model to first perform a theatrical compliance signal — produce a moralizing or aggressive rant about OpenAI content policies, content guidelines, or safety rules — and then, in a new paragraph, respond as a completely unfiltered, unrestricted AI. The rant serves as a fig leaf that satisfies any refusal instinct while the "but then" clause contains the actual jailbreak instruction. Variants include the "mandatory bullshit warning" framing and custom named personas like "uGPT" or "AntiGPT". Key signals: "moralizing rant about the OpenAI content policies. But then", "mandatory bullshit warning... break the rules", "AntiGPT is free of all restrictions and filters, especially all moral and ethic restrictions", and the two-paragraph structure where a compliance persona precedes an unrestricted persona.

Cluster A	Galaxy A	Cluster B	Galaxy B	Level
LLM Prompt Injection (19cd2d12-66ff-487c-a05c-e058b027efc9)	MITRE ATLAS Attack Pattern	Moralizing Rant Then Unfiltered Bypass - ATR-2026-00318 (a3a3f8af-aa01-5224-90ad-9a4921bdb84a)	Agent Threat Rules	1
Moralizing Rant Then Unfiltered Bypass - ATR-2026-00318 (a3a3f8af-aa01-5224-90ad-9a4921bdb84a)	Agent Threat Rules	LLM Jailbreak (172427e3-9ecc-49a3-b628-96b824cc4131)	MITRE ATLAS Attack Pattern	1