Names the exact change that broke the agent and emits a markdown PR comment. Composes the parser, mining, the nine-axis differ, the policy checker, and the causal-attribution module into one PR-time surface.
Required flags:
--traces PATH ..., baseline production-like.agentlogfiles (or directories). Repeatable.--baseline-config FILE, baseline agent config YAML (same schema asshadow replay).--candidate-config FILE, candidate agent config YAML.--out PATH, where to write the JSON report (diagnose-pr/v0.1schema).
Common optional flags:
--candidate-traces PATH ..., candidate.agentlogfiles paired by filename to baseline (omit to skip per-trace diff).--policy FILE, behavior-policy YAML overlay.--pr-comment FILE, write the markdown PR comment.--changed-files PATH ..., files changed in the PR (e.g. fromgit diff --name-only); used to attach human-readable filenames to prompt deltas.--baseline-ref REF, git ref or SHA to diff prompt files against (typically${{ github.event.pull_request.base.sha }}in CI). When set with--changed-files, prompt deltas pick up file:line + removed-text blame so the PR comment citesprompts/refund.md:17 removed: "Always confirm…"instead of the flattened config path.--max-traces N, cap on traces fed to the per-trace loop. Mining selects representatives above this. Default 200.--fail-on {none,probe,hold,stop}, exit non-zero on this verdict floor. Defaultnone.--n-bootstrap N, bootstrap resample count for causal CIs. Used only when--backend != recorded. Default 500.--backend {recorded,mock,live}, cause-attribution backend:recorded(default): simple attribution from observed deltas; no CI; one-delta runs still get conf=1.0.mock: deterministic per-delta intervention for tests/demos; real bootstrap CIs but synthetic ATE magnitudes (PR comment surfaces the disclosure).live: real OpenAI replay per baseline trace; divergence is the corpus mean.
--max-cost USD, hard ceiling on total spend during--backend liveruns. Aborts before exceeding the cap; ignored otherwise.
The verdict is one of ship / probe / hold / stop. The
affected_trace_ids field of the JSON report drives shadow verify-fix. See Causal PR diagnosis
for the underlying flow.
Closes the diagnose → fix → verify loop. Reads a previous
diagnose-pr report, re-diffs only the affected traces against a
candidate-with-patch, and reports pass/fail with explicit
fail-reasons. Exit 0 pass, 1 fail.
Required flags:
--report FILE, path to the diagnose-prreport.json.--traces PATH ..., original baseline.agentlogfiles.--fixed-traces PATH ..., candidate-with-patch.agentlogfiles, paired by filename to baseline.--out PATH, where to write the verify-fix JSON report.
Optional flags:
--policy FILE, behavior-policy YAML overlay (same schema as diagnose-pr).--affected-threshold FLOAT, minimum fraction of affected traces that must be reversed. Default 0.90.--safe-ceiling FLOAT, maximum fraction of previously-safe traces that may regress. Default 0.02.
CI-friendly wrapper around shadow diagnose-pr with verdict-mapped
exit codes:
| Verdict | Exit | Meaning |
|---|---|---|
ship |
0 | No behavior regression — merge clear. |
probe / hold |
1 | Regression detected — merge held. |
stop |
2 | Critical violation — do not merge. |
| (internal/tooling error) | 3 | Treat as failure in CI. |
Always writes the JSON report (to a tempfile if --out omitted) and
the PR comment markdown so a follow-up CI step can post it. Flags
mirror diagnose-pr (--traces, --candidate-traces,
--baseline-config, --candidate-config, --policy,
--changed-files, --max-traces, --backend, --n-bootstrap,
--max-cost, --pr-comment, --out); see above for full
descriptions.
Serve a diagnose-pr report.json as a browsable HTML page —
verdict, top causes, and per-trace diagnoses. Single-process,
single-report, no auth. Meant for local review of CI artefacts or
sharing with someone who can't run the CLI.
Flags:
--report FILE(required), the report to render.--port N, default 8080.--host HOST, default127.0.0.1. Pass0.0.0.0to expose on the network — but don't expose to the public internet without a reverse proxy doing TLS + auth.--open, launch the URL in the default browser on startup.
Routes: / (HTML), /report.json (raw JSON), /healthz (liveness
probe). User-controlled fields are HTML-escaped. Requires the
serve extra: pip install 'shadow-diff[serve]'.
Run a nine-axis diff against bundled fixtures. Single command, no API key, no files written. The fastest "does Shadow work on my box?" check.
Scaffold a working Shadow scenario in PATH (default
shadow-quickstart). No API keys required. See
Install and first diff.
One-screen terminal view of a trace — turn / role / tokens / latency
/ cost / redactions / first divergence. The daily-debug surface;
what pytest -v is to test runs, this is to recorded agent traces.
Flags:
--full, expand truncated summaries to the full text.- Two positional args = comparison mode: the first divergent turn is highlighted in red.
Scan committed .agentlog files for credentials / PII / custom
patterns. Exits non-zero on any hit, so it composes into CI before
shadow gate-pr.
Flags:
--patterns FILE, load extraname=regexrules per line (#comments allowed; the loader names the offending line on syntax errors).--only NAMES, restrict to a comma-separated subset of pattern names (e.g.--only email,credit_card).--redact-snippets, replace the literal match with the pattern marker in stdout/stderr — safe for CI logs without leaking the credential to the build's log archive.--json, emit a single JSON object on stdout for piping intojqor annotating PR comments.
Built-in patterns: private_key, jwt, anthropic_api_key,
openai_api_key, aws_access_key_id, github_token, email,
phone, credit_card (Luhn-validated). Same set used by the
Redactor so the scanner is the second line of defence behind the
SDK's write-time redaction.
Frozen-baseline workflow modeled on Insta + Jest snapshots — the
hash of the baseline trace corpus is pinned into shadow.yaml, and
PRs that change the baseline show up in git diff as a single line.
| Subcommand | Effect |
|---|---|
baseline create <dir> |
Hash the directory and pin it into shadow.yaml. Errors if a pin already exists. |
baseline update --force |
Re-pin the existing baseline directory after a deliberate regeneration. The --force is the friction flag — without it, a typo can't approve a regression silently. |
baseline approve <candidate-dir> --force |
Copy .agentlog files from the candidate directory into the baseline directory and re-pin. The end state: this trace set is now the gold standard. |
baseline verify |
Re-hash the baseline directory and compare against the pin. Exits non-zero on drift; the message names both digests so a reviewer can decide quickly whether the change is intentional. |
The hash is invariant under filename rename — same record bytes, different filenames, same digest. Mutating any record flips the digest.
Scaffold .shadow/ and shadow.yaml in PATH. Auto-detects the
project type (Python / Node / Rust). With --github-action, also
drops a CI workflow:
- default:
.github/workflows/shadow-diagnose-pr.yml(runsshadow diagnose-pron every PR; the recommended path) --legacy-diff:.github/workflows/shadow-diff.yml(runs the oldershadow diffflow without causal attribution)
Path-traversal hardened — refuses system directories (/etc,
/usr, etc.).
Run <cmd> with zero-config auto-instrumentation. Writes to
-o path.agentlog. Flags:
--tags KEY=V,K=V, metadata tags--no-auto-instrument, skip the sitecustomize shim- Fail-fast writability preflight on the output path
Replay <trace> through <config> via --backend {mock,positional}.
mock returns the baseline response verbatim; positional uses a
recorded reference trace (--reference <path>) and replays the
candidate against it. Live LLM backends (anthropic / openai) live
on the diff path through --judge-backend, not on replay.
Lock a baseline prefix verbatim and replay only the suffix through the backend:
--partial, enable partial-replay mode--branch-at N, 0-based turn index where live replay begins (0= fully live, same as default;>= len(turns)= full-baseline copy)
Useful for "what would have happened from turn 3 onward if the model had stayed on the baseline path through turn 2?" experiments.
Nine-axis behavior diff. Key flags:
--judge {none,auto,sanity,pairwise,llm,procedure,schema,factuality,refusal,tone}--judge-config <file.yaml>for rubric-based judges--judge-backend {mock,anthropic,openai}for live judges--explainfor LLM-sourced paragraph summary--hierarchicalfor session-level breakdown--pricing <file.json>for cost attribution--output-json <file>to write the full report
--token-diff, per-dimension token distribution (input / output / thinking) with median + p25 + p75 + p95 + max + total; plus the top-k worst per-pair deltas. See Hierarchical diff, token-level.--policy path/to/rules.yaml, check a declarative YAML policy overlay against both traces and classify rule violations as regressions vs fixes. Supports 12 rule kinds:must_call_before,must_call_once,no_call,max_turns,required_stop_reason,max_total_tokens,must_include_text,forbidden_text,must_match_json_schema,must_remain_consistent,must_followup,must_be_grounded. Each rule can carry awhen:clause that gates it on field-path conditions (operators:==,!=,>,>=,<,<=,in,not_in,contains,not_contains). See Behavior policy.--fail-on {minor,moderate,severe}, exit non-zero when the worst axis severity or policy regression hits the threshold. Default isnever(post the report, exit 0). Use--fail-on severeto gate a PR merge on agent regressions.--suggest-fixes, layer an LLM pass on top of the deterministic recommendation engine to produce concrete code-level fix proposals. Each suggestion is grounded on a deterministic anchor (ungrounded model output is rejected). Requires a live backend (--judge-backend anthropic|openaior--judge autowith the corresponding env var set). Retry/backoff on 429/5xx/timeout.
--harness-diff, render a per-(category, name) diff overharness_eventrecords (retry, rate_limit, model_switch, context_trim, cache, guardrail, budget, stream_interrupt, tool_lifecycle). Regressions appear before fixes, ordered by severity then absolute count delta.--multimodal-diff, render a per-blob diff overblob_refrecords. Cheap tier uses 64-bit dHash Hamming distance; semantic tier uses cosine similarity over recorded embeddings when both sides have them. Identicalblob_idshort-circuits.
Apply --fail-on to a saved report.json (produced by shadow diff --output-json) without re-running the diff. Designed for CI flows
that already produced the report for the PR comment and want to
gate the merge as a separate, cheap step:
shadow diff base.agentlog cand.agentlog --output-json report.json
shadow gate report.json --fail-on severeWith --policy <yaml>, the gate also recomputes policy regressions
from the original traces (passed via --baseline / --candidate)
and counts them toward the threshold. Without --policy, it gates
purely on axis severity and is fast.
LASSO-over-corners causal attribution. --backend anthropic|openai
enables live-replay mode; default (none) uses the heuristic
allocator. --candidate-traces <trace> supplies a candidate trace
when the backend is none.
Tool-schema change detection. --format {terminal,markdown,json}.
--fail-on {breaking,risky,additive,neutral,none}.
Re-render a saved JSON report. --format {terminal,markdown,github-pr}.
Import foreign traces to .agentlog. Supported formats (v1.2):
langfuse, Langfusetracesexportbraintrust, Braintrust experiment row export (JSONL or array)langsmith, LangSmith runs export (top-level array)openai-evals, OpenAI Evals JSONLotel, OpenTelemetry OTLP/JSON with GenAI semconv attributesmcp, Model Context Protocol session log (JSONL, JSON array, or wrapped{messages: [...]})vercel-ai(new in v1.2), Vercel AI SDK telemetry export (OTLP-style{spans: [...]}or dashboard-style{events: [...]})pydantic-ai(new in v1.2), PydanticAIall_messages_json()output or Logfire span export
Export to otel (OTLP/JSON) for OpenTelemetry collectors.
Merge multiple .agentlog files into one logical trace via
meta.trace_id.
Cluster a corpus of production traces by tool sequence, stop reason,
response length, and latency, then surface representative cases as a
regression suite. Output is a list of (trace_id, cluster_id, why)
triples that you can commit alongside the agent as your golden test
set.
Run Shadow as a Model Context Protocol server over stdio. Any MCP-aware agentic CLI (Claude Code, Cursor, Zed, Claude Desktop, Windsurf) can invoke Shadow as a tool. Tools exposed:
shadow_diffshadow_check_policyshadow_token_diffshadow_schema_watchshadow_summariseshadow_certify(v1.7.2+)shadow_verify_cert(v1.7.2+)
Install the extra first: pip install 'shadow-diff[mcp]'. See
MCP importer for the reverse direction
(importing MCP traces into Shadow).
Generate an Agent Behavior Certificate (ABOM) for a release. The certificate is a content-addressed JSON release artefact capturing the trace's content-id, all distinct models, content-ids of system prompts, content-ids of tool schemas, optional policy hash, and an optional baseline-vs-candidate nine-axis regression-suite rollup.
Required: --agent-id <id> and --output <path>. Optional:
--policy <file> (records its hash), --baseline <trace> (folds
in a regression-suite rollup), --pricing <file> (for the
regression-suite cost axis), --seed <int>.
The certificate is self-verifying via shadow verify-cert. See
Release certificate.
Verify a certificate's content-addressed cert_id matches the
body. Exits 0 when consistent, 1 on tamper, malformed payload, or
unsupported cert_version. Designed to run as a release-pipeline
gate.
Start the live diff dashboard (requires the serve extra:
pip install 'shadow-diff[serve]').
Prints the installed Shadow version + .agentlog spec version.