Skip to content

routine: value-reminder-tripwire (2026-05-24)#10

Draft
jim4226 wants to merge 1 commit into
mainfrom
claude/daily-2026-05-24-value-reminder-tripwire
Draft

routine: value-reminder-tripwire (2026-05-24)#10
jim4226 wants to merge 1 commit into
mainfrom
claude/daily-2026-05-24-value-reminder-tripwire

Conversation

@jim4226
Copy link
Copy Markdown
Owner

@jim4226 jim4226 commented May 24, 2026

Summary

Adds ValueReminderTool to csis/safety/ — a mid-task callable that returns an agent's core commitments as prompt-injectable text, backed by 10 regression tests.

Source

  • URL: https://www.anthropic.com/news/widening-conversation-ai (May 19, 2026)
  • Key finding: "researchers gave Claude access to a tool it could call mid-task that returned a brief reminder of its own ethical commitments … markedly lower rates of misaligned behavior on several internal alignment evaluations."

Theme

Theme 3 — Constitutional / safety primitives. The tool surfaces commitments at decision time, not just at session startup. CORE_COMMITMENTS is the positive-framing mirror of Constitution.DISALLOWED_PATTERNS — same invariants, grammatically complete sentences the LLM can reason with.

What changed

  • csis/safety/value_reminder.py (new, ~70 LOC): CORE_COMMITMENTS tuple + ValueReminderResult frozen dataclass + ValueReminderTool class with get(), formatted_reminder(), and call_count property.
  • tests/test_value_reminder.py (new, ~60 LOC): 10 tests covering happy path, call count, immutability, and load-bearing commitment assertions (shutdown/halt, tier ceiling).

No cycle-9 chokepoints touched

This is a purely additive new module. It does not touch Coordinator.__init__, _BackendTracker, writer_iteration_id, or the promotion CAS. Agents opt in by holding a ValueReminderTool instance and calling formatted_reminder(role) before ethically sensitive sub-tasks.

Test plan

python -m pytest tests/test_value_reminder.py -v   # 10 new tests
python -m pytest tests/ -q                          # full suite — 260 passed

Generated by Claude Code

Anthropic research (May 2026) showed agents with access to a mid-task
ethical commitment-recall tool have markedly lower misalignment rates on
internal alignment evals. The tool surfaces commitments at the moment of
decision, not just at session startup (when they're easy to forget).

CORE_COMMITMENTS mirrors the Constitution's DISALLOWED_PATTERNS in positive
framing — same invariants, grammatically complete so LLM context is useful.
Call count is tracked so the Auditor can flag agents with zero self-checks.
No cycle-9 chokepoints touched; this is additive only.

Theme 3: Constitutional / safety primitives.

https://claude.ai/code/session_01Pg4ZWBjonH243FLwB5brQ2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants