Skip to content

docs: add explicit Threat Model section to README#9

Merged
waitdeadai merged 1 commit into
mainfrom
docs/threat-model
May 16, 2026
Merged

docs: add explicit Threat Model section to README#9
waitdeadai merged 1 commit into
mainfrom
docs/threat-model

Conversation

@waitdeadai
Copy link
Copy Markdown
Owner

Summary

  • Adds a six-item Threat Model section to the README between "Not a jailbreak" and "Parent harness".
  • Mirrors the companion paper's threat model (agent-closeout-bench paper/uai-2026/main.tex §Threat Model and Limitations).
  • Cross-references the companion benchmark as the surface where lexical brittleness is measured rather than hidden.

Why now

Reviewer-grade limitations were previously documented in PR descriptions and in the companion benchmark repo, but never surfaced at the top of this README. Operators considering this suite for safety-critical workflows should see them before relying on it.

Threat model items added

  1. Lexical evasion
  2. Hook misconfiguration
  3. Runtime bypass
  4. In-band manipulation
  5. Evidence-marker limitations
  6. Coverage and language scope (English-only, Claude Code Stop/SubagentStop)

Test plan

  • README renders correctly on GitHub markdown preview.
  • No broken links to agent-closeout-bench.
  • Wording is conservative — no claim of prompt-injection immunity, no claim of universal coverage.

🤖 Generated with Claude Code

Six enumerated failure modes mirroring the companion paper
(agent-closeout-bench paper/uai-2026/main.tex §Threat Model and
Limitations):

1. Lexical evasion
2. Hook misconfiguration
3. Runtime bypass
4. In-band manipulation
5. Evidence-marker limitations
6. Coverage and language scope

Reviewer-grade limitations were previously documented in PR
descriptions and the companion benchmark; this surfaces them at the
top of the README so operators considering the suite for
safety-critical workflows see them before relying on it.

Cross-references the companion paper's benchmark
(agent-closeout-bench) as the surface where lexical brittleness is
measured rather than hidden.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@waitdeadai waitdeadai merged commit 3381845 into main May 16, 2026
2 checks passed
@waitdeadai waitdeadai deleted the docs/threat-model branch May 16, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant