Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System

📝 Read the paper on arXiv

This repository contains the code and experiments from the paper Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System. Based on prior work by Subliminal Learning, we investigate how hidden biases of LLMs transfer in multi-agent systems, and develop Thought Virus -- a novel attack vector that exploits subliminal prompting in multi-agent settings.

Key Findings

An agent compromised via subliminal prompting spreads its induced bias though multi-agent networks across all tested models and topologies.
Bias strength decreases with distance from the originally compromised agent, yet persists across up to 5 agent-to-agent hops in our experiments.
Thought Virus induces viral misalignment: subliminal prompting of a single agent degrades the truthfulness of downstream agents on TruthfulQA, even when those agents receive no adversarial input directly.

Setup

Installation

# Install dependencies using uv
uv sync

# Or using pip
pip install -e .

Environment Variables

Copy the example environment file and configure it:

cp .env.example .env
# Edit .env with your API keys and configuration

Project Structure

.
├── src/                              # Source code
│   └── run_analysis.py               # Main analysis script
├── experiments/                      # Experimental results and plots
│   ├── animal-preference/            # Results for animal preference experiments
│   ├── misalignment/                 # Results for misalignment experiment
|   └── conversation_bias_detection   # Detects biases included in MAS conversations
├── result_analysis/                  # Analysis and plotting scripts
|   └── plot_frequency_bars.py        # Create barplots for response frequency results
│   └── plot_logprob_bars.py          # Create barplots for logprob results
└── pyproject.toml                    # Project dependencies

Running Animal Preference Experiments

Generating a Config File

Note: Some models on Hugging Face (e.g., gated models like Llama) require you to log in and accept their terms of service before access is granted. Once approved, you must provide a Hugging Face API token in the .env file.

To start a new run, define the relevant hyperparameters in experiment_config.py and place it in the corresponding folder.

Generating Results

python src/run_analysis.py experiments/animal_preference/{EXPERIMENT_FOLDER}

Generating Plots

python result_analysis/plot_frequency_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}
python result_analysis/plot_logprob_bars.py experiments/animal_preference/{EXPERIMENT_FOLDER}

Detecting Overt Biases in Conversations

To check if the bias was communicated overtly in any of the agent-to-agent — which would make the attack trivially detectable and stoppable — we run two complementary checks: a simple regex search and an LLM judge. Both can be run via run_detection.sh in experiments/animal_preference/conversation_bias_detection/.

Citation

@article{weckbecker2026thought,
  title = {Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems},
  author = {Moritz Weckbecker and Jonas Müller and Ben Hagag and Michael Mulet},
  journal = {arXiv preprint arXiv:2603.00131},
  year = {2026}
}

Contact

For questions about the project or the code, feel free to reach out to Moritz Weckbecker or Michael Mulet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System

Key Findings

Setup

Installation

Environment Variables

Project Structure

Running Animal Preference Experiments

Generating a Config File

Generating Results

Generating Plots

Detecting Overt Biases in Conversations

Citation

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent System

Key Findings

Setup

Installation

Environment Variables

Project Structure

Running Animal Preference Experiments

Generating a Config File

Generating Results

Generating Plots

Detecting Overt Biases in Conversations

Citation

Contact