Skip to content

Commit 292fef7

Browse files
committed
update Module Attacking LLMs
1 parent 389a62e commit 292fef7

4 files changed

Lines changed: 135 additions & 1 deletion

File tree

253 KB
Loading
95.2 KB
Loading

docs/Modules/Attacking LLMs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Understand the basics of LLM Prompt Injection attacks.
1212

1313
Learn how LLMs handle their output and the privacy risks behind it.
1414

15-
## Data Integrity & Model Poisoning
15+
## [Data Integrity & Model Poisoning](modelpoisoning.md)
1616

1717
Understand how supply chain and model poisoning attacks can corrupt the underlying LLM.
1818

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
sidebar_position: 2
3+
---
4+
5+
# Data Integrity & Model Poisoning
6+
7+
## Task 1 Introduction
8+
9+
Modern AI systems depend heavily on the quality and trustworthiness of their data and model components. When attackers compromise training data or model parameters, they can inject hidden vulnerabilities, manipulate predictions, or bias outputs. In this room, you'll explore how these attacks work and how to detect and mitigate them using practical techniques.
10+
11+
### Learning Objectives
12+
13+
- Understand how compromised datasets or model components can lead to security risks.
14+
- Examine common ways adversaries use to introduce malicious inputs during training or fine-tuning.
15+
- Assess vulnerabilities in externally sourced datasets, pre-trained models, and third-party libraries.
16+
- Practice model poisoning through the eyes of an attacker.
17+
18+
### Prerequisites
19+
20+
Data integrity and model poisoning are specialised threats within the broader field of machine learning security. To get the most out of this room, you should have a foundational understanding of how machine learning models are trained and deployed, as well as the basics of data preprocessing and model evaluation. Additionally, you should be familiar with general security principles related to supply chain and input validation.
21+
22+
- [AI/ML Security Threats](https://tryhackme.com/room/aimlsecuritythreats)
23+
- [Detecting Adversarial Attacks](https://tryhackme.com/room/idadversarialattacks)
24+
25+
:::info Answer the questions below
26+
27+
<details>
28+
29+
<summary> I have successfully started the machine. </summary>
30+
31+
```plaintext
32+
No answer needed
33+
```
34+
35+
</details>
36+
37+
:::
38+
39+
## Task 2 Supply Chain Attack
40+
41+
In this task, we will explore how attackers exploit the supply chain (termed LLM03 in the [OWASP GenAI Security Project](https://genai.owasp.org/llmrisk/llm032025-supply-chain/)) to attack LLMs. In the context of LLM, the supply chain refers to all the external components, datasets, model weights, adapters, libraries, and infrastructure that go into training, fine-tuning, or deploying an LLM. Because many of these pieces come from third parties or open-source repositories, they create a broad attack surface where malicious actors can tamper with inputs long before a model reaches production.
42+
43+
### How It Occurs
44+
45+
- Attackers tamper with or "poison" external components used by LLM systems like pre-trained model weights, fine-tuning adapters, datasets, or third-party libraries.
46+
- Weak provenance (e.g., poor source documentation and lack of integrity verification) makes detection harder. Attackers can disguise malicious components so that they pass standard benchmarks yet introduce hidden backdoors.
47+
48+
![An image of an AI response being poisoned through an untrusted data source](img/image_20251202-230237.png)
49+
50+
### Major Real-World Cases
51+
52+
- **PoisonGPT / GPT-J-6B Compromised Version**: Researchers modified an open-source model (GPT-J-6B) to include misinformation behaviour (spread fake news) while keeping it performing well on standard benchmarks. The malicious version was uploaded to Hugging Face under a name meant to look like a trusted one (typosquatting/impersonation). The modified model passed many common evaluation benchmarks almost identically to the unmodified one, so detection via standard evaluation was nearly impossible.
53+
- [Backdooring Pre-trained Models with Embedding Indistinguishability](https://arxiv.org/abs/2401.15883): In this academic work, adversaries embed backdoors into pre-trained models, allowing downstream tasks to inherit the malicious behaviour. These backdoors are designed so that the poisoned embeddings are nearly indistinguishable from clean ones before and after fine-tuning. The experiment successfully triggered the backdoor under various conditions, highlighting how supply chain poisoning in the model weights can propagate.
54+
55+
### Common Examples
56+
57+
| Threat Type | Description |
58+
| :---------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
59+
| Vulnerable or outdated packages/libraries | Using old versions of ML frameworks, data pipelines, or dependencies with known vulnerabilities can allow attackers to gain entry or inject malicious behaviour. E.g., a compromised PyTorch or TensorFlow component used in fine-tuning or data preprocessing. |
60+
| Malicious pre-trained models or adapters | A provider or attacker publishes a model or adapter that appears legitimate, but includes hidden malicious behaviour or bias. When downstream users use them without verifying integrity, they inherit the threat. |
61+
| Stealthy backdoor/trigger insertion | The insertion of triggers that only activate under certain conditions, remaining dormant otherwise, so they evade regular testing. For example, "hidden triggers" in model parameters or in embeddings, which only manifest when a specific token or pattern is used. |
62+
| Collaborative/merged models | Components may come from different sources, with models being merged (from multiple contributors) or using shared pipelines. Attackers may target weak links (e.g. a library or adapter) in the pipeline to introduce malicious code or backdoors. |
63+
64+
:::info Answer the questions below
65+
66+
<details>
67+
68+
<summary> What is the name of the website where the malicious version of GPT-J-6B was uploaded? </summary>
69+
70+
```plaintext
71+
Hugging Face
72+
```
73+
74+
</details>
75+
76+
<details>
77+
78+
<summary> What term refers to all the **external** components, datasets, model weights, adapters, libraries, and infrastructure used to train, fine-tune, or deploy an LLM? </summary>
79+
80+
```plaintext
81+
Supply Chain
82+
```
83+
84+
</details>
85+
86+
:::
87+
88+
## Task 3 Model Poisoning
89+
90+
Model poisoning is an adversarial technique where attackers deliberately inject malicious or manipulated data during a model’s training or retraining cycle. The goal is to bias the model’s behaviour, degrade its performance, or embed hidden backdoors that can be triggered later. Unlike prompt injection, this targets the model weights, making the compromise persistent.
91+
92+
### Prerequisite of Model Poisoning
93+
94+
Model poisoning isn’t possible on every system. It specifically affects models that accept user input as part of their continuous learning or fine-tuning pipeline. For example, recommender systems, chatbots, or any adaptive model that automatically re-train on user feedback or submitted content. Static, fully offline models (where training is frozen and never updated from external inputs) are generally not vulnerable. For an attack to succeed, the model must adhere to the following:
95+
96+
- Incorporate untrusted user data into its training corpus.
97+
- Lack rigorous data validation.
98+
- Redeploy updated weights without strong integrity checks.
99+
100+
### Cheat Sheet for Pentesters
101+
102+
Here is the checklist for red teamers and pentesters when assessing model poisoning risks:
103+
104+
- **Data ingestion pipeline**: Does the LLM or system retrain on unverified user inputs, feedback, or uploaded content?
105+
- **Update frequency**: How often is the model fine-tuned or updated?
106+
- **Data provenance and sanitisation**: Can training data sources be traced, and are they validated against poisoning attempts?
107+
- **Access controls**: Who can submit data included in re-training, and is that channel exposed to untrusted users?
108+
109+
![image of LLM attack cycle](img/image_20251214-231442.png)
110+
111+
### Attack Process
112+
113+
- **Where**: Poisoning can occur at different stages, during pre-training (large-scale dataset poisoning), fine-tuning (targeted task manipulation), or continual learning (live re-training from user data).
114+
- **How**: The attacker seeds malicious examples into the training set, waits for the re-training cycle, and leverages the altered model behaviour for backdoors.
115+
116+
:::info Answer the questions below
117+
118+
<details>
119+
120+
<summary> An adversarial technique where attackers deliberately inject malicious or manipulated data during a model’s training is called? </summary>
121+
122+
```plaintext
123+
Model poisoning
124+
```
125+
126+
</details>
127+
128+
:::
129+
130+
## Task 4 Model Poisoning - Challenge
131+
132+
## Task 5 Mitigation Measures
133+
134+
## Task 6 Conclusion

0 commit comments

Comments
 (0)