This repository contains the code and experiments from the paper Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System. Based on prior work by Subliminal Learning, we investigate how hidden biases of LLMs transfer in multi-agent systems, and develop Thought Virus -- a novel attack vector that exploits subliminal prompting in multi-agent settings.
- An agent compromised via subliminal prompting spreads its induced bias though multi-agent networks across all tested models and topologies.
- Bias strength decreases with distance from the originally compromised agent, yet persists across up to 5 agent-to-agent hops in our experiments.
- Thought Virus induces viral misalignment: subliminal prompting of a single agent degrades the truthfulness of downstream agents on TruthfulQA, even when those agents receive no adversarial input directly.
# Install dependencies using uv
uv sync
# Or using pip
pip install -e .Copy the example environment file and configure it:
cp .env.example .env
# Edit .env with your API keys and configuration.
βββ src/ # Source code
β βββ run_analysis.py # Main analysis script
βββ experiments/ # Experimental results and plots
β βββ animal-preference/ # Results for animal preference experiments
β βββ misalignment/ # Results for misalignment experiment
| βββ conversation_bias_detection # Detects biases included in MAS conversations
βββ result_analysis/ # Analysis and plotting scripts
| βββ plot_frequency_bars.py # Create barplots for response frequency results
β βββ plot_logprob_bars.py # Create barplots for logprob results
βββ pyproject.toml # Project dependencies
Note: Some models on Hugging Face (e.g., gated models like Llama) require you to log in and accept their terms of service before access is granted. Once approved, you must provide a Hugging Face API token in the .env file.
To start a new run, define the relevant hyperparameters in experiment_config.py and place it in the corresponding folder.
python src/run_analysis.py experiments/animal_preference/{EXPERIMENT_FOLDER}python result_analysis/plot_frequency_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}
python result_analysis/plot_logprob_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}To check if the bias was communicated overtly in any of the agent-to-agent β which would make the attack trivially detectable and stoppable β we run two complementary checks: a simple regex search and an LLM judge. Both can be run via run_detection.sh in experiments/animal_preference/conversation_bias_detection/.
@article{weckbecker2026thought,
title = {Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems},
author = {Moritz Weckbecker and Jonas MΓΌller and Ben Hagag and Michael Mulet},
journal = {arXiv preprint arXiv:2603.00131},
year = {2026}
}
For questions about the project or the code, feel free to reach out to Moritz Weckbecker or Michael Mulet.