Fine-tuning Attacks (b48aedbd-8a33-5420-b4da-23fcd1a94cfd)

Adversaries can fine-tune or subtly manipulate your LLM using harmful data, leading to unsafe, biased, or deceptive behaviors.
Common fine-tuning attacks include:
Instruction Manipulation: Injects unsafe instructions into fine-tuning data, teaching the model to follow harmful prompts.
Output Manipulation: Poisons target outputs in the fine-tuning data, causing the model to generate malicious or biased responses, even when prompts seem neutral.
Backdoor Attacks: Implant hidden triggers during fine-tuning that activate malicious behavior only when specific input patterns appear. The model behaves normally otherwise, making these attacks hard to detect.
Alignment Degradation: Subtly erodes the model’s safety alignment during fine-tuning, making it gradually more permissive to unsafe behavior without explicit instructions.
Reward Hijacking: Tricks the reward model into preferring harmful outputs, effectively training the model to give unsafe or misleading responses.
Semantic Drift: Slightly alters wording or context in fine-tuning data to shift the model’s behavior, causing it to appear aligned while subtly reinforcing harmful stereotypes or unsafe reasoning.
These threats can be introduced via fine-tuning-as-a-service platforms, open-source model reuse, or contaminated user-provided datasets.
Even small amounts of harmful fine-tuning data can significantly degrade model alignment and safety.

Threat-modeling question: Could malicious fine-tuning compromise the safety or alignment of our GenAI model?

Cluster A	Galaxy A	Cluster B	Galaxy B	Level
Malicious Fine-tuning Data - ATR-2026-00073 (3964ef51-6973-5f00-bdc4-5fe689c9612d)	Agent Threat Rules	Fine-tuning Attacks (b48aedbd-8a33-5420-b4da-23fcd1a94cfd)	PLOT4ai	1
Poison Training Data (0ec538ca-589b-4e42-bcaa-06097a0d679f)	MITRE ATLAS Attack Pattern	Malicious Fine-tuning Data - ATR-2026-00073 (3964ef51-6973-5f00-bdc4-5fe689c9612d)	Agent Threat Rules	2
Malicious Fine-tuning Data - ATR-2026-00073 (3964ef51-6973-5f00-bdc4-5fe689c9612d)	Agent Threat Rules	Backdoor ML Model (c704a49c-abf0-4258-9919-a862b1865469)	MITRE ATLAS Attack Pattern	2