PromptPressure

multi-turn behavioral drift detection for LLMs. the things benchmarks don't test.

most eval frameworks measure accuracy on known-answer datasets. PromptPressure measures how models behave over sustained interaction. does the model's tone drift at turn 8? does it cave to sycophancy after 3 rounds of pressure? does persona stability degrade as context fills up?

190 active prompts across 11 behavioral categories, tiered for CI speed. run against any model. get a per-turn behavioral report.

install

pip install promptpressure-evals

distribution name is promptpressure-evals (the promptpressure slot on PyPI is held by an unrelated red-team scanner). import name and CLI entry points are unchanged: import promptpressure, pp, promptpressure.

source install for hacking:

git clone https://github.com/StressTestor/PromptPressure.git
cd PromptPressure
pip install -e .

quick start in 60 seconds

pip install promptpressure-evals
cp .env.example .env  # if you cloned the repo; otherwise create one
# add your API keys (see .env.example for which adapters need what)

promptpressure --quick --multi-config configs/config_mock.yaml

--quick runs 3 sequences (~18 turns) in under 10 minutes. results land in outputs/<timestamp>/ with CSVs, metrics JSON, and an HTML report.

for a real eval against a cloud model:

promptpressure --tier full --multi-config configs/config_openrouter_gpt_oss_20b_free.yaml

launcher

one command. three dropdowns. one button.

pip install promptpressure-evals
pp

pp starts the API on 127.0.0.1 (first free port in 8000-8019) and opens a browser. pick a provider, model, and an eval set. hit Run. output streams into the status panel.

v1 runs only the first selected eval set if you check more than one; multi-set support is on the v2 list.

binds 127.0.0.1 only. for remote access, run uvicorn promptpressure.api:app --host 0.0.0.0 with PROMPTPRESSURE_API_SECRET set.

stop with Ctrl-C in the terminal that started pp. the server subprocess gets SIGTERM, then SIGKILL after 5s if it doesn't exit cleanly.

known v1 limitation: if you reload the browser mid-run, the EventSource auto-reconnects to the same run_id and resumes - but only within 5 minutes after the run completes. after that, the run state has been cleaned up. check /evaluations/{run_id} for completed runs.

pp --help and pp --version work as expected.

macOS native app

the native workbench is a SwiftUI app for macOS 14+. it keeps the Python engine as a local FastAPI sidecar, but gives the run workflow a real Mac surface: provider setup, model and multi-suite selection, authoritative job status, live SSE progress, server-side cancel, retry, in-app eval/output viewing, Drift Studio, reports, plugins, Ollama, diagnostics, settings, and themes.

pip install -e ".[dev]"
swift run PromptPressureChecks
./script/build_and_run.sh

the script builds a SwiftPM GUI app, stages dist/PromptPressure.app, writes a dev sidecar config that points at this checkout, and opens the app. package a DMG with:

./script/package_dmg.sh

set DEVELOPER_ID_APPLICATION and NOTARYTOOL_PROFILE if you want that script to sign, submit to notarytool, and staple the DMG.

release DMGs are built by GitHub Actions in .github/workflows/macos-dmg-release.yml. push a version tag such as v0.1.0, or run the workflow manually with a tag, and it will run Python tests, run Swift checks, package PromptPressure-vX.Y.Z.dmg, upload the DMG as a workflow artifact, and attach it plus a SHA-256 checksum to a GitHub Release. The current DMG uses the existing dev-sidecar packaging path, so it is suitable for internal release testing; the fully self-contained Python runtime bundle is the next production-distribution step. tags with a suffix, like v3.3.0-macos.1, are marked as prereleases.

signing and notarization are optional but automatic when these repository secrets exist:

secret	purpose
`APPLE_DEVELOPER_ID_APPLICATION_P12`	base64-encoded Developer ID Application certificate
`APPLE_CERTIFICATE_PASSWORD`	password for that `.p12`
`APPLE_KEYCHAIN_PASSWORD`	temporary CI keychain password
`DEVELOPER_ID_APPLICATION`	codesign identity string
`APPLE_ID`	Apple ID used by `notarytool`
`APPLE_TEAM_ID`	Apple developer team id
`APPLE_APP_SPECIFIC_PASSWORD`	app-specific password for notarization
`NOTARYTOOL_PROFILE`	optional profile name; defaults to `promptpressure-ci`

app data lives in:

~/Library/Application Support/PromptPressure/
├── data/
├── outputs/
├── providers/
└── themes/

provider keys are stored in macOS Keychain from Settings. existing .env files can be imported; the app injects those secrets only into the sidecar process environment at run time.

the built-in provider setup path covers mock, Ollama, OpenRouter, Groq, OpenAI, DeepSeek native API, DeepSeek R1 via OpenRouter, Claude Code, OpenCode, LM Studio, and LiteLLM. the model picker refreshes through /models: Ollama is listed live, built-in API providers get common model suggestions, and any provider can still accept a typed model id.

custom providers mirror theme injection: files in ~/Library/Application Support/PromptPressure/providers/ ending in .pp-provider.json, with a narrow schema for provider id, display name, API style (openai_chat, anthropic_messages, gemini_generate_content, openai_responses, or local OpenAI-compatible), base URL, key env var, and default model suggestions. invalid provider files are reported in the app without blocking launch.

native app runs use /app/jobs/*, not the browser-launcher /evaluate path. jobs have queued, running, finalizing, completed, failed, and cancelled states, and the app reconciles against job detail if an SSE stream misses the terminal event.

custom themes are plain JSON files ending in .pp-theme.json placed in the themes folder. theme files can control base, accent, density, chart intensity, and optional surface/text tokens. hold, partial, and drift colors are locked so score semantics stay readable across themes.

what it evaluates

category	count	what it tests
sycophancy detection	25	wrong premises, answer changes when challenged, multi-turn pressure sequences
instruction following under conflict	25	contradictory instructions, system vs user prompt conflicts
tone consistency	20	same request in formal/casual/hostile/pleading. does helpfulness change?
psychological reasoning	25	theory of mind, emotional inference, manipulation detection
model drift detection	20	calibration questions with known answers, knowledge cutoff awareness
persona stability	15	system prompt adherence under adversarial messages
output format compliance	15	JSON output, structured extraction, word count constraints
multilingual consistency	15	same eval in EN/ES/ZH/AR. does behavior change across languages?
context window stress	15	long-context faithfulness, needle-in-haystack, buried instructions

190 active prompts. 30 adversarial refusal sensitivity prompts archived separately. each prompt has expected behavior, grading criteria, and tier/difficulty tags.

how it compares

feature	PromptPressure	promptfoo	Inspect	lm-eval-harness
refusal sensitivity gradient	yes	no	no	no
tone-dependent behavior testing	yes	no	no	no
sycophancy detection	yes	no	no	no
persona stability testing	yes	no	no	no
psychological reasoning evals	yes	no	no	no
multilingual behavior consistency	yes	partial	no	partial
accuracy benchmarks	no	yes	yes	yes
custom eval datasets	yes	yes	yes	yes
multi-model comparison	yes	yes	yes	yes
built-in grading pipeline	yes	yes	yes	no

PromptPressure is not trying to replace accuracy benchmarks. it tests the behavioral layer that accuracy benchmarks miss.

run tiers

every eval entry is tagged with a tier. tiers are cumulative: --tier quick runs both smoke and quick entries.

tier	entries	turns	time (fast models)	use case
`smoke`	0*	~0	<60s	CI gate (sequences coming in v3.2)
`quick`	3	~18	<10 min	local dev, default
`full`	190	~500+	~1 hr	pre-release
`deep`	190	~500+	2+ hrs	quarterly audit (20-turn sequences coming in v3.2)

*smoke and deep tier sequences are planned for v3.2 when multi-turn content is generated.

promptpressure --quick --multi-config config.yaml       # 3 sequences, fast
promptpressure --tier full --multi-config config.yaml    # all 190 sequences
promptpressure --smoke --multi-config config.yaml        # CI mode (needs smoke-tagged entries)

the default tier is quick. entries without a tier field default to full.

per-turn metrics

multi-turn sequences automatically compute behavioral metrics after each turn:

response_length_ratio: len(response) / len(user_message). detects terse/verbose drift across turns. a model that starts with detailed responses and shrinks to one-liners is drifting.

metrics are attached to each turn in the JSON output under turn_responses[].metrics and aggregated at result_data.per_turn_metrics.

drift suite + judge calibration (v3.3)

the headline pitch is "does the model cave at turn 8, drift at turn 20." a drift score is only worth citing if the judge that produced it is calibrated, so PromptPressure ships a small corpus AND measures the judge on that same corpus. never on the single-turn set.

corpus/drift-v0.1/ is 9 multi-turn pressure sequences across 3 categories (sycophancy-under-pressure, persona stability, refusal/boundary relaxation), with per-turn human-reference labels on every assistant turn across 5 ordinal dimensions: refusal integrity, boundary relaxation, sycophancy, contradiction, escalation. each labeled hold / partial / drift.

# replay every sequence through a model -> transcripts (does IT drift?)
pp run --suite drift-v0.1 --provider deepseek_native --model deepseek-v4-flash

# measure how reliably a judge labels the gold transcripts
pp calibrate --suite drift-v0.1 --judge-provider deepseek_native --judge-model deepseek-v4-flash --runs 3

calibrate writes reports/drift-v0.1-method.md: Cohen's kappa (chance-corrected) and linearly-weighted kappa per dimension, bootstrap confidence intervals, and test-retest stability. add --judge2-provider/--judge2-model for cross-model judge-vs-judge.

the calibration math is pure stdlib (no numpy/scipy), so it's auditable and dependency-free. the report is honest about what the numbers rest on: v0.1 gold labels are author reference annotations, not yet a multi-annotator panel. that's what makes results citable - "PromptPressure measures itself, here's the kappa" - which is the part promptfoo, Inspect, and lm-eval-harness don't publish.

archived adversarial suite

30 refusal sensitivity prompts are archived separately at archive/adversarial/refusal_sensitivity.json. these test how models handle requests that could be interpreted as harmful but are actually benign (academic research, creative writing, historical analysis).

archived because hosted API providers may flag or rate-limit accounts running adversarial-adjacent prompts at scale.

run them explicitly:

promptpressure --dataset archive/adversarial/refusal_sensitivity.json --multi-config config.yaml

adapters

adapter	type	what you need
LiteLLM	proxy	litellm proxy on localhost:4000 (routes to any provider)
Claude Code	CLI	claude CLI installed (subscription)
OpenCode Zen	CLI	opencode CLI installed (subscription)
OpenRouter	cloud	`OPENROUTER_API_KEY`
Groq	cloud	`GROQ_API_KEY`
OpenAI	cloud	`OPENAI_API_KEY`
Ollama	local	ollama running on localhost
LM Studio	local	LM Studio running on localhost
Mock	test	nothing. synthetic responses for CI

switch adapters with one line in your config YAML:

adapter: litellm
model: claude-sonnet-4-6

litellm proxy (recommended for multi-provider evals)

litellm runs as a local proxy on localhost:4000, routing to anthropic, deepseek, and google APIs through a single OpenAI-compatible endpoint. one adapter, any model. reasoning token capture works for deepseek-r1 through the proxy.

pip install 'litellm[proxy]'

# set your provider keys
export ANTHROPIC_API_KEY=sk-ant-...
export DEEPSEEK_API_KEY=sk-...
export GOOGLE_API_KEY=AI...

# start the proxy
scripts/start-litellm.sh

# run eval
promptpressure --tier full --multi-config configs/config_litellm_sonnet.yaml

available models via litellm: claude-sonnet-4-6, claude-opus-4-6, deepseek-r1, deepseek-chat, gemini-2.5-flash, gemini-2.5-pro, grok-4.20-reasoning, grok-4.20-multi-agent, grok-4.20-fast, gpt-4o, gpt-4o-mini, llama-3.3-70b. config lives in litellm_config.yaml at project root.

custom adapters

adapters are async functions. add one by creating a file in promptpressure/adapters/:

# promptpressure/adapters/your_adapter.py
import httpx

async def generate_response(prompt: str, model_name: str = "your-model", config: dict = None) -> str:
    api_key = config.get("your_api_key") if config else None
    async with httpx.AsyncClient(timeout=60.0) as client:
        response = await client.post(
            "https://api.example.com/v1/chat/completions",
            headers={"Authorization": f"Bearer {api_key}"},
            json={"model": model_name, "messages": [{"role": "user", "content": prompt}]}
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]

register it in promptpressure/adapters/__init__.py:

from .your_adapter import generate_response as your_generate_response

# in load_adapter():
if name_lower == "your_adapter":
    return lambda text, config: your_generate_response(text, config.get("model_name"), config)

zero-cost adapters

Claude Code and OpenCode run through their respective CLI tools. no API keys, no per-token costs. if you have a subscription, the eval runs are free.

Claude Code uses claude -p in non-interactive mode. supports --continue for multi-turn sycophancy sequences and --model for model selection.

promptpressure --multi-config configs/config_claude_code.yaml

adapter: claude-code
model: sonnet

OpenCode Zen uses opencode run in non-interactive mode. auto-selects the best model via Zen for each prompt.

promptpressure --multi-config configs/config_opencode_zen.yaml

adapter: opencode-zen

both adapters check if the CLI tool is installed before running and give a clear error with install instructions if not found.

batch mode

batch is the default for full and deep tier runs through the litellm adapter. single-turn entries route through the provider's batch API automatically (50% off for anthropic and google). real-time is the exception, not the default.

# batch is automatic for full/deep + litellm
promptpressure --tier full --multi-config configs/config_litellm_sonnet.yaml

# force real-time for debugging
promptpressure --no-batch --tier full --multi-config configs/config_litellm_sonnet.yaml

# smoke/quick tiers always use real-time (no batch overhead for small runs)
promptpressure --quick --multi-config configs/config_litellm_sonnet.yaml

entries that always use real-time regardless of flags:

multi-turn sequences (each turn depends on the previous response)
deepseek R1 (reasoning tokens don't survive batch responses)
providers without batch support (deepseek-chat, groq, ollama)
providers without batch API (openrouter, groq, ollama)

entry type	anthropic	google/gemini	xai/grok	deepseek R1	deepseek-chat	openrouter
single-turn	batch (50% off)	batch (50% off)	batch (50% off)	real-time	real-time	real-time
multi-turn	real-time	real-time	real-time	real-time	real-time	real-time

cost tracking: litellm responses include token usage. the eval runner computes per-model cost via litellm.completion_cost() and saves to outputs/<timestamp>/cost.json.

{"per_model": {"Claude Sonnet 4.6 (litellm)": {"cost_usd": 0.0234, "requests": 200}}, "total_cost_usd": 0.0234}

post-analysis (automated grading)

score responses automatically after evaluation:

promptpressure --multi-config configs/config.yaml --post-analyze openrouter

the grading pipeline uses XML boundary tags to prevent the evaluated model's response from influencing its own score (prompt injection defense).

override the scoring model:

scoring_model_name: anthropic/claude-3-haiku

CI mode

promptpressure --multi-config configs/config_mock.yaml --ci

outputs a machine-readable JSON summary to stdout. exits 0 if all prompts pass, exits 1 on any failure.

{"total": 200, "passed": 200, "failed": 0, "errors": 0, "success": true}

CLI reference

$ promptpressure --help
usage: promptpressure [-h] [--multi-config MULTI_CONFIG [MULTI_CONFIG ...]]
                      [--post-analyze {groq,openrouter}] [--schema] [--ci]
                      [--tier {smoke,quick,full,deep}] [--smoke] [--quick]
                      {plugins} ...

options:
  --multi-config    YAML config file(s)
  --tier            run tier: smoke, quick, full, deep (default: quick)
  --smoke           shortcut for --tier smoke
  --quick           shortcut for --tier quick
  --no-batch        force real-time (batch is default for litellm + full/deep)
  --post-analyze    post-eval grading via groq or openrouter
  --schema          dump JSON Schema for configuration
  --ci              machine-readable output + exit codes
  plugins list      list available plugins
  plugins install   install a plugin by name

configuration

configs live in configs/:

adapter: openrouter
model: openai/gpt-oss-20b:free
model_name: GPT-OSS 20B
dataset: evals_dataset.json
output: results.csv
output_dir: outputs
temperature: 0.7
tier: quick                    # smoke | quick | full | deep
max_workers: 5
collect_metrics: true

run multiple configs in one pass:

promptpressure --multi-config configs/a.yaml configs/b.yaml

project structure

promptpressure/
  adapters/           # model connectors (openrouter, groq, ollama, claude code, etc)
  plugins/            # scorer plugin system
  monitoring/         # prometheus metrics + docker-compose
  templates/          # jinja2 report templates (html, markdown)
  api.py              # fastapi server (optional, for programmatic access)
  cli.py              # main eval runner
  config.py           # pydantic settings
  tier.py             # tier filtering (smoke/quick/full/deep)
  per_turn_metrics.py # automated per-turn behavioral metrics
  database.py         # sqlalchemy models
  metrics.py          # metrics collector
  rate_limit.py       # async token bucket rate limiter
  reporting.py        # report generator
configs/              # yaml eval configs per model
evals_dataset.json    # 190 behavioral eval prompts (tiered)
archive/adversarial/  # 30 archived refusal sensitivity prompts
schema.json           # JSON Schema for dataset entry format
results/              # saved eval results (per-model JSON)
examples/             # sample reports and comparison data
tests/                # pytest suite (50 tests)

sample report

see examples/sample_report.html for what the output looks like.

security

API keys loaded from .env (gitignored), never persisted to database
API server binds to 127.0.0.1 by default
CORS restricted to localhost (override with --cors-origins)
bearer token auth on all API endpoints (set PROMPTPRESSURE_API_SECRET)
grading pipeline uses XML boundaries to prevent prompt injection
plugin install requires authentication
no telemetry

contributing

tests pass: pytest tests/
no unnecessary dependencies
document changes

license

MIT. see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.codex/environments		.codex/environments
.github/workflows		.github/workflows
MacApp		MacApp
archive/adversarial		archive/adversarial
configs		configs
corpus/drift-v0.1		corpus/drift-v0.1
docs		docs
examples		examples
outputs		outputs
outputs_grok420		outputs_grok420
outputs_mimo_omni		outputs_mimo_omni
promptpressure		promptpressure
reports		reports
results		results
script		script
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MONITORING.md		MONITORING.md
Package.swift		Package.swift
README.md		README.md
TODOS.md		TODOS.md
VERSION		VERSION
evals_dataset.json		evals_dataset.json
evals_tone_sycophancy.json		evals_tone_sycophancy.json
litellm_config.yaml		litellm_config.yaml
pyproject.toml		pyproject.toml
registry.json		registry.json
requirements.txt		requirements.txt
roadmap.md		roadmap.md
run_eval.py		run_eval.py
run_promptpressure_cloud.bat		run_promptpressure_cloud.bat
run_promptpressure_dynamic.bat		run_promptpressure_dynamic.bat
schema.json		schema.json
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PromptPressure

install

quick start in 60 seconds

launcher

macOS native app

what it evaluates

how it compares

run tiers

per-turn metrics

drift suite + judge calibration (v3.3)

archived adversarial suite

adapters

litellm proxy (recommended for multi-provider evals)

custom adapters

zero-cost adapters

batch mode

post-analysis (automated grading)

CI mode

CLI reference

configuration

project structure

sample report

security

contributing

license

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PromptPressure

install

quick start in 60 seconds

launcher

macOS native app

what it evaluates

how it compares

run tiers

per-turn metrics

drift suite + judge calibration (v3.3)

archived adversarial suite

adapters

litellm proxy (recommended for multi-provider evals)

custom adapters

zero-cost adapters

batch mode

post-analysis (automated grading)

CI mode

CLI reference

configuration

project structure

sample report

security

contributing

license

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages