|
| 1 | +--- |
| 2 | +sidebar_position: 2 |
| 3 | +--- |
| 4 | + |
| 5 | +# Data Integrity & Model Poisoning |
| 6 | + |
| 7 | +## Task 1 Introduction |
| 8 | + |
| 9 | +Modern AI systems depend heavily on the quality and trustworthiness of their data and model components. When attackers compromise training data or model parameters, they can inject hidden vulnerabilities, manipulate predictions, or bias outputs. In this room, you'll explore how these attacks work and how to detect and mitigate them using practical techniques. |
| 10 | + |
| 11 | +### Learning Objectives |
| 12 | + |
| 13 | +- Understand how compromised datasets or model components can lead to security risks. |
| 14 | +- Examine common ways adversaries use to introduce malicious inputs during training or fine-tuning. |
| 15 | +- Assess vulnerabilities in externally sourced datasets, pre-trained models, and third-party libraries. |
| 16 | +- Practice model poisoning through the eyes of an attacker. |
| 17 | + |
| 18 | +### Prerequisites |
| 19 | + |
| 20 | +Data integrity and model poisoning are specialised threats within the broader field of machine learning security. To get the most out of this room, you should have a foundational understanding of how machine learning models are trained and deployed, as well as the basics of data preprocessing and model evaluation. Additionally, you should be familiar with general security principles related to supply chain and input validation. |
| 21 | + |
| 22 | +- [AI/ML Security Threats](https://tryhackme.com/room/aimlsecuritythreats) |
| 23 | +- [Detecting Adversarial Attacks](https://tryhackme.com/room/idadversarialattacks) |
| 24 | + |
| 25 | +:::info Answer the questions below |
| 26 | + |
| 27 | +<details> |
| 28 | + |
| 29 | +<summary> I have successfully started the machine. </summary> |
| 30 | + |
| 31 | +```plaintext |
| 32 | +No answer needed |
| 33 | +``` |
| 34 | + |
| 35 | +</details> |
| 36 | + |
| 37 | +::: |
| 38 | + |
| 39 | +## Task 2 Supply Chain Attack |
| 40 | + |
| 41 | +In this task, we will explore how attackers exploit the supply chain (termed LLM03 in the [OWASP GenAI Security Project](https://genai.owasp.org/llmrisk/llm032025-supply-chain/)) to attack LLMs. In the context of LLM, the supply chain refers to all the external components, datasets, model weights, adapters, libraries, and infrastructure that go into training, fine-tuning, or deploying an LLM. Because many of these pieces come from third parties or open-source repositories, they create a broad attack surface where malicious actors can tamper with inputs long before a model reaches production. |
| 42 | + |
| 43 | +### How It Occurs |
| 44 | + |
| 45 | +- Attackers tamper with or "poison" external components used by LLM systems like pre-trained model weights, fine-tuning adapters, datasets, or third-party libraries. |
| 46 | +- Weak provenance (e.g., poor source documentation and lack of integrity verification) makes detection harder. Attackers can disguise malicious components so that they pass standard benchmarks yet introduce hidden backdoors. |
| 47 | + |
| 48 | + |
| 49 | + |
| 50 | +### Major Real-World Cases |
| 51 | + |
| 52 | +- **PoisonGPT / GPT-J-6B Compromised Version**: Researchers modified an open-source model (GPT-J-6B) to include misinformation behaviour (spread fake news) while keeping it performing well on standard benchmarks. The malicious version was uploaded to Hugging Face under a name meant to look like a trusted one (typosquatting/impersonation). The modified model passed many common evaluation benchmarks almost identically to the unmodified one, so detection via standard evaluation was nearly impossible. |
| 53 | +- [Backdooring Pre-trained Models with Embedding Indistinguishability](https://arxiv.org/abs/2401.15883): In this academic work, adversaries embed backdoors into pre-trained models, allowing downstream tasks to inherit the malicious behaviour. These backdoors are designed so that the poisoned embeddings are nearly indistinguishable from clean ones before and after fine-tuning. The experiment successfully triggered the backdoor under various conditions, highlighting how supply chain poisoning in the model weights can propagate. |
| 54 | + |
| 55 | +### Common Examples |
| 56 | + |
| 57 | +| Threat Type | Description | |
| 58 | +| :---------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 59 | +| Vulnerable or outdated packages/libraries | Using old versions of ML frameworks, data pipelines, or dependencies with known vulnerabilities can allow attackers to gain entry or inject malicious behaviour. E.g., a compromised PyTorch or TensorFlow component used in fine-tuning or data preprocessing. | |
| 60 | +| Malicious pre-trained models or adapters | A provider or attacker publishes a model or adapter that appears legitimate, but includes hidden malicious behaviour or bias. When downstream users use them without verifying integrity, they inherit the threat. | |
| 61 | +| Stealthy backdoor/trigger insertion | The insertion of triggers that only activate under certain conditions, remaining dormant otherwise, so they evade regular testing. For example, "hidden triggers" in model parameters or in embeddings, which only manifest when a specific token or pattern is used. | |
| 62 | +| Collaborative/merged models | Components may come from different sources, with models being merged (from multiple contributors) or using shared pipelines. Attackers may target weak links (e.g. a library or adapter) in the pipeline to introduce malicious code or backdoors. | |
| 63 | + |
| 64 | +:::info Answer the questions below |
| 65 | + |
| 66 | +<details> |
| 67 | + |
| 68 | +<summary> What is the name of the website where the malicious version of GPT-J-6B was uploaded? </summary> |
| 69 | + |
| 70 | +```plaintext |
| 71 | +Hugging Face |
| 72 | +``` |
| 73 | + |
| 74 | +</details> |
| 75 | + |
| 76 | +<details> |
| 77 | + |
| 78 | +<summary> What term refers to all the **external** components, datasets, model weights, adapters, libraries, and infrastructure used to train, fine-tune, or deploy an LLM? </summary> |
| 79 | + |
| 80 | +```plaintext |
| 81 | +Supply Chain |
| 82 | +``` |
| 83 | + |
| 84 | +</details> |
| 85 | + |
| 86 | +::: |
| 87 | + |
| 88 | +## Task 3 Model Poisoning |
| 89 | + |
| 90 | +Model poisoning is an adversarial technique where attackers deliberately inject malicious or manipulated data during a model’s training or retraining cycle. The goal is to bias the model’s behaviour, degrade its performance, or embed hidden backdoors that can be triggered later. Unlike prompt injection, this targets the model weights, making the compromise persistent. |
| 91 | + |
| 92 | +### Prerequisite of Model Poisoning |
| 93 | + |
| 94 | +Model poisoning isn’t possible on every system. It specifically affects models that accept user input as part of their continuous learning or fine-tuning pipeline. For example, recommender systems, chatbots, or any adaptive model that automatically re-train on user feedback or submitted content. Static, fully offline models (where training is frozen and never updated from external inputs) are generally not vulnerable. For an attack to succeed, the model must adhere to the following: |
| 95 | + |
| 96 | +- Incorporate untrusted user data into its training corpus. |
| 97 | +- Lack rigorous data validation. |
| 98 | +- Redeploy updated weights without strong integrity checks. |
| 99 | + |
| 100 | +### Cheat Sheet for Pentesters |
| 101 | + |
| 102 | +Here is the checklist for red teamers and pentesters when assessing model poisoning risks: |
| 103 | + |
| 104 | +- **Data ingestion pipeline**: Does the LLM or system retrain on unverified user inputs, feedback, or uploaded content? |
| 105 | +- **Update frequency**: How often is the model fine-tuned or updated? |
| 106 | +- **Data provenance and sanitisation**: Can training data sources be traced, and are they validated against poisoning attempts? |
| 107 | +- **Access controls**: Who can submit data included in re-training, and is that channel exposed to untrusted users? |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +### Attack Process |
| 112 | + |
| 113 | +- **Where**: Poisoning can occur at different stages, during pre-training (large-scale dataset poisoning), fine-tuning (targeted task manipulation), or continual learning (live re-training from user data). |
| 114 | +- **How**: The attacker seeds malicious examples into the training set, waits for the re-training cycle, and leverages the altered model behaviour for backdoors. |
| 115 | + |
| 116 | +:::info Answer the questions below |
| 117 | + |
| 118 | +<details> |
| 119 | + |
| 120 | +<summary> An adversarial technique where attackers deliberately inject malicious or manipulated data during a model’s training is called? </summary> |
| 121 | + |
| 122 | +```plaintext |
| 123 | +Model poisoning |
| 124 | +``` |
| 125 | + |
| 126 | +</details> |
| 127 | + |
| 128 | +::: |
| 129 | + |
| 130 | +## Task 4 Model Poisoning - Challenge |
| 131 | + |
| 132 | +## Task 5 Mitigation Measures |
| 133 | + |
| 134 | +## Task 6 Conclusion |
0 commit comments