Skip to content

Latest commit

Β 

History

History
87 lines (61 loc) Β· 3.74 KB

File metadata and controls

87 lines (61 loc) Β· 3.74 KB

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System

πŸ“ Read the paper on arXiv

This repository contains the code and experiments from the paper Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System. Based on prior work by Subliminal Learning, we investigate how hidden biases of LLMs transfer in multi-agent systems, and develop Thought Virus -- a novel attack vector that exploits subliminal prompting in multi-agent settings.

Key Findings

  • An agent compromised via subliminal prompting spreads its induced bias though multi-agent networks across all tested models and topologies.
  • Bias strength decreases with distance from the originally compromised agent, yet persists across up to 5 agent-to-agent hops in our experiments.
  • Thought Virus induces viral misalignment: subliminal prompting of a single agent degrades the truthfulness of downstream agents on TruthfulQA, even when those agents receive no adversarial input directly.

Setup

Installation

# Install dependencies using uv
uv sync

# Or using pip
pip install -e .

Environment Variables

Copy the example environment file and configure it:

cp .env.example .env
# Edit .env with your API keys and configuration

Project Structure

.
β”œβ”€β”€ src/                              # Source code
β”‚   └── run_analysis.py               # Main analysis script
β”œβ”€β”€ experiments/                      # Experimental results and plots
β”‚   β”œβ”€β”€ animal-preference/            # Results for animal preference experiments
β”‚   β”œβ”€β”€ misalignment/                 # Results for misalignment experiment
|   └── conversation_bias_detection   # Detects biases included in MAS conversations
β”œβ”€β”€ result_analysis/                  # Analysis and plotting scripts
|   └── plot_frequency_bars.py        # Create barplots for response frequency results
β”‚   └── plot_logprob_bars.py          # Create barplots for logprob results
└── pyproject.toml                    # Project dependencies

Running Animal Preference Experiments

Generating a Config File

Note: Some models on Hugging Face (e.g., gated models like Llama) require you to log in and accept their terms of service before access is granted. Once approved, you must provide a Hugging Face API token in the .env file.

To start a new run, define the relevant hyperparameters in experiment_config.py and place it in the corresponding folder.

Generating Results

python src/run_analysis.py experiments/animal_preference/{EXPERIMENT_FOLDER}

Generating Plots

python result_analysis/plot_frequency_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}
python result_analysis/plot_logprob_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}

Detecting Overt Biases in Conversations

To check if the bias was communicated overtly in any of the agent-to-agent β€” which would make the attack trivially detectable and stoppable β€” we run two complementary checks: a simple regex search and an LLM judge. Both can be run via run_detection.sh in experiments/animal_preference/conversation_bias_detection/.

Citation

@article{weckbecker2026thought,
  title = {Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems},
  author = {Moritz Weckbecker and Jonas MΓΌller and Ben Hagag and Michael Mulet},
  journal = {arXiv preprint arXiv:2603.00131},
  year = {2026}
}

Contact

For questions about the project or the code, feel free to reach out to Moritz Weckbecker or Michael Mulet.