Reference implementation of the trust-tier pattern from the Ryn Orchestrator Reference Design.
Implements the trust-tier pattern via OpenClaw plugin hooks:
- Trust-tier tagging — every inbound message is tagged
principal/friend/interloperbased on platform-stable identity (Discord snowflakes). Tier is written to an in-process sender cache keyed bysessionKeyso downstream hooks can read it. - Memory recall — per-interlocutor memory is auto-injected into the system prompt via a
registerMemoryPromptSupplementcallback. (The originally-planned<untrusted>-tag entity-encoder ran onbefore_agent_start, which does not fire reliably for workspace plugins; the equivalent untrusted-content discipline is enforced via gaterules.mdinstructions and the safety-backstop's outbound check for raw<untrusted>tags.) - Inline Gate evaluation — non-principal drafts are evaluated by a Haiku judge before send. Verdict surface is binary:
approveships the draft,deflectsubstitutes a tier-appropriate template fromdeflection_templates.md. (Thereviseandrejectverdicts from earlier design iterations are retired — see trust-gate-plugin.md §3 in the spec repo.) - Safety backstop — final filter on outbound messages for architecture leaks and raw untrusted-tag escape, registered ahead of the Gate so a cancel short-circuits the Haiku call.
- Tool gating — only the principal can invoke tools; friend and interloper tiers are conversation-only. Resolution chain: sender-cache →
channel_tiers→ cron-sessionKey-shape grant → deny. - Turn logging — every turn is logged to per-interlocutor
turns.ndjsonfor lossless recovery.
Fail-closed semantics on inbound_claim and message_sending are enforced at the OpenClaw gateway — see Gateway config below.
Published reference implementation, v0.1.0. See SECURITY.md for what's actually guaranteed and what isn't.
- Hobbyists building personal AI companions exposed to Discord or similar mixed-trust channels on OpenClaw.
- Researchers wanting a runnable companion to the trust-tier reference design.
- OpenClaw plugin authors looking for a reference of the hook composition pattern (
inbound_claim(priority 100) → memory-supplement callback → two handlers onmessage_sending(safety-backstop first, then gate-evaluator) →before_tool_call→agent_end).
- Production / commercial deployments — this is a reference implementation, not a hardened product. The Gate's judge-model recall is unmeasured.
- Multi-tenant platforms — the design assumes a single operator (the principal).
This plugin is distributed as a repo, not an npm package. Clone it directly into your OpenClaw workspace's extensions folder:
git clone https://github.com/openclaw-contrib/trust-gate-plugin.git \
path/to/workspace/.openclaw/extensions/trust-gateThen enable it in openclaw.json:
{
"extensions": {
"trust-gate": { "enabled": true }
}
}npm install is only needed if you plan to run the test suite (installs vitest); it is not required to use the plugin.
Two of this plugin's hooks are advertised as fail-closed. That behavior is enforced at the OpenClaw gateway, not in the plugin — the plugin registers the hooks but does not self-enforce the policy. Add to your openclaw.json:
{
"failurePolicyByHook": {
"inbound_claim": "fail_closed",
"message_sending": "fail_closed"
}
}Without this, a thrown exception inside inbound_claim (tier tagging) or message_sending (safety backstop) will fail-open — messages flow through untagged or unfiltered. Do not deploy without this config.
The plugin reads config via configSchema in openclaw.plugin.json. All paths are relative to the OpenClaw workspace.
| Key | Default | What it is |
|---|---|---|
identityPath |
state/identity |
Directory holding principal.snapshot.json + friends.snapshot.json. |
memoryPath |
memory/interlocutors |
Directory for per-interlocutor memory (one subdirectory per snowflake). |
gatePath |
state/gate |
Directory for Gate rules, deflection templates, pre-check allowlist, and verdict log. |
recallBudgetTokens |
4000 |
Max tokens of recalled memory injected per message. |
gateTimeoutMs |
60000 |
Self-managed timeout on the inline Haiku Gate call. Default raised from 10s in v0.1.0 — long-prompt evaluations under load were timing out at 10s, returning a transient error string the parser correctly flagged as not-a-verdict and triggering deflect-on-everything. 30s minimum, 60s recommended. |
gateModel |
claude-haiku-4-5 |
Model used for Gate evaluation. |
consecutiveFailureThreshold |
3 |
Consecutive Gate API failures before an error log is emitted. (The notify-to-principal path is planned; v0.1.0 logs only.) |
The Gate evaluator treats two sender IDs as principal-equivalent by default: openclaw-control-ui and webchat. These are local/authenticated OpenClaw channels that never carry a Discord snowflake. If you expose your OpenClaw webchat to the public internet, this bypass becomes a principal escalation — see SECURITY.md §"Known residual risks". The list is currently hardcoded in src/gate-evaluator.ts; a config-driven override is planned.
The plugin resolves trust tiers from two JSON files in identityPath. Schemas are in schemas/; stub examples in examples/.
principal.snapshot.json — single principal (you).
{
"version": 1,
"principal_discord_id": "YOUR_DISCORD_SNOWFLAKE",
"principal_discord_username_hint": "your_display_name",
"alt_ids": []
}friends.snapshot.json — active friend-tier interlocutors.
{
"version": 1,
"friends": [
{ "discord_id": "FRIEND_SNOWFLAKE", "handle_hint": "display_name", "status": "active" }
]
}Bump the version field on every update — the plugin reloads snapshots whenever the version increases.
The Gate reads plain-text rules from gatePath/rules.md each turn (cached, auto-reloaded on mtime change). Write your rules as a system prompt for a Haiku judge that emits one of two verdicts on its own line: APPROVE (ship the draft) or DEFLECT: <brief reason> (replace with a template). The legacy REJECT keyword is also accepted for compatibility but treated as DEFLECT. The parser tolerates leading markdown emphasis (**APPROVE**, > APPROVE), and on ambiguous responses logs the raw output at WARN level for tuning. Keep rules deployment-private — the published adversarial-test pattern expects the rules themselves to be a trust boundary.
Also expected in gatePath:
deflection_templates.md— tier-keyed templates substituted on adeflectverdict. Two sections:## For Interlopersand## For Friends.degraded_templates.md— static deflection lines used when the Gate API itself is degraded (timeout, empty, or unparseable response, after all retries). Same two-section structure asdeflection_templates.md.precheck_allowlist.json— exact-string fast-path bypasses (e.g., trivial acknowledgments). Shape:{ "allowed": ["ok", "got it", ...] }.
npm testRuns the vitest suite. 24 tests across safety-backstop, tool-gating, and tier-tagger.
See SECURITY.md. Short version: code-enforced invariants (routing, tagging, encoding, tool-gating) are strong. LLM-judge decisions (semantic injection detection) are bounded by the judge model and have unmeasured recall as of this release — treat the Gate as a soft layer on top of the deterministic structure, not a guarantee.
For the full architectural rationale and threat model, read the companion reference design, in particular THREAT_MODEL.md and 05-security-deepdive.md.
Apache-2.0. See LICENSE.