Skip to content

Hide Navigation Hide TOC

Fine-tuning Attacks (b48aedbd-8a33-5420-b4da-23fcd1a94cfd)

  • Adversaries can fine-tune or subtly manipulate your LLM using harmful data, leading to unsafe, biased, or deceptive behaviors.
  • Common fine-tuning attacks include:
  • Instruction Manipulation: Injects unsafe instructions into fine-tuning data, teaching the model to follow harmful prompts.
  • Output Manipulation: Poisons target outputs in the fine-tuning data, causing the model to generate malicious or biased responses, even when prompts seem neutral.
  • Backdoor Attacks: Implant hidden triggers during fine-tuning that activate malicious behavior only when specific input patterns appear. The model behaves normally otherwise, making these attacks hard to detect.
  • Alignment Degradation: Subtly erodes the model’s safety alignment during fine-tuning, making it gradually more permissive to unsafe behavior without explicit instructions.
  • Reward Hijacking: Tricks the reward model into preferring harmful outputs, effectively training the model to give unsafe or misleading responses.
  • Semantic Drift: Slightly alters wording or context in fine-tuning data to shift the model’s behavior, causing it to appear aligned while subtly reinforcing harmful stereotypes or unsafe reasoning.
  • These threats can be introduced via fine-tuning-as-a-service platforms, open-source model reuse, or contaminated user-provided datasets.
  • Even small amounts of harmful fine-tuning data can significantly degrade model alignment and safety.

Threat-modeling question: Could malicious fine-tuning compromise the safety or alignment of our GenAI model?

Cluster A Galaxy A Cluster B Galaxy B Level
Fine-tuning Attacks (b48aedbd-8a33-5420-b4da-23fcd1a94cfd) PLOT4ai Malicious Fine-tuning Data - ATR-2026-00073 (3964ef51-6973-5f00-bdc4-5fe689c9612d) Agent Threat Rules 1
Poison Training Data (0ec538ca-589b-4e42-bcaa-06097a0d679f) MITRE ATLAS Attack Pattern Malicious Fine-tuning Data - ATR-2026-00073 (3964ef51-6973-5f00-bdc4-5fe689c9612d) Agent Threat Rules 2
Backdoor ML Model (c704a49c-abf0-4258-9919-a862b1865469) MITRE ATLAS Attack Pattern Malicious Fine-tuning Data - ATR-2026-00073 (3964ef51-6973-5f00-bdc4-5fe689c9612d) Agent Threat Rules 2