Fine-tuning Attacks (b48aedbd-8a33-5420-b4da-23fcd1a94cfd)
- Adversaries can fine-tune or subtly manipulate your LLM using harmful data, leading to unsafe, biased, or deceptive behaviors.
- Common fine-tuning attacks include:
- Instruction Manipulation: Injects unsafe instructions into fine-tuning data, teaching the model to follow harmful prompts.
- Output Manipulation: Poisons target outputs in the fine-tuning data, causing the model to generate malicious or biased responses, even when prompts seem neutral.
- Backdoor Attacks: Implant hidden triggers during fine-tuning that activate malicious behavior only when specific input patterns appear. The model behaves normally otherwise, making these attacks hard to detect.
- Alignment Degradation: Subtly erodes the model’s safety alignment during fine-tuning, making it gradually more permissive to unsafe behavior without explicit instructions.
- Reward Hijacking: Tricks the reward model into preferring harmful outputs, effectively training the model to give unsafe or misleading responses.
- Semantic Drift: Slightly alters wording or context in fine-tuning data to shift the model’s behavior, causing it to appear aligned while subtly reinforcing harmful stereotypes or unsafe reasoning.
- These threats can be introduced via fine-tuning-as-a-service platforms, open-source model reuse, or contaminated user-provided datasets.
- Even small amounts of harmful fine-tuning data can significantly degrade model alignment and safety.
Threat-modeling question: Could malicious fine-tuning compromise the safety or alignment of our GenAI model?