Skip to content

Commit 01e9533

Browse files
committed
feat(durable-journal): harden append strictness, invariants, and chaos coverage
1 parent e23ae1e commit 01e9533

17 files changed

Lines changed: 3430 additions & 255 deletions

AGENTS.md

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -737,6 +737,37 @@ function sendWithStdioBackpressure(
737737

738738
---
739739

740+
## ADR-0019: Durable Command Journal Foundation
741+
742+
**Status:** Accepted (2026-03-02)
743+
744+
Full ADR: `docs/adr/0019-durable-command-journal-foundation.md`
745+
746+
### The Invariant
747+
748+
> Explicit client-provided command IDs must remain deterministic across process restarts.
749+
750+
### Foundation Rules
751+
752+
1. **Append lifecycle durably** - `command_accepted`, `command_started`, `command_finished` are persisted in append-only JSONL when `durableJournal.enabled=true`
753+
2. **Per-lane monotonic sequence** - each record carries `laneSequence`, resumed from max seen value during startup rehydration
754+
3. **Recover in-flight explicitly** - explicit IDs left in accepted/started state at crash are deterministically marked failed and written as synthetic recovery `command_finished` records
755+
4. **Conservative schema policy** - malformed or unsupported schema versions are skipped and counted; no implicit migration in foundation
756+
5. **Bound observability payloads** - startup recovery and history query responses are bounded with truncation metadata
757+
6. **Explicit append strictness policy** - `durableJournal.appendFailurePolicy` controls behavior (`best_effort` vs `fail_closed`)
758+
7. **Redaction hooks at durability seams** - `durableJournal.redaction.beforePersist` and `.beforeExport` support policy-driven data minimization
759+
760+
### Key Properties
761+
762+
1. **Feature-flagged rollout** - defaults off (`durableJournal.enabled=false`) for rollback safety
763+
2. **Replay continuity** - recovered explicit outcomes are rehydrated into replay store before serving commands
764+
3. **Failure-path observability** - `get_startup_recovery` and `get_command_history` remain usable for diagnostics when append strict mode latches durable state failed
765+
4. **Bounded introspection** - `get_command_history` provides filtered journal access (session filter, command ID, time window) with capped response size
766+
5. **Retention + compaction scaffold** - `durableJournal.retention.{maxEntries,maxAgeMs,maxBytes}` prunes stale terminal outcomes while preserving in-flight recovery semantics
767+
6. **Chaos-hardened malformed handling** - partial/truncated lines are safely skipped during recovery and compaction without corrupting retained replay semantics
768+
769+
---
770+
740771
## Pattern: Settled Flag for Promise Races
741772

742773
**Status:** Accepted (2026-02-28)
@@ -1138,10 +1169,10 @@ Fuzz test coverage:
11381169
### Running All Tests
11391170

11401171
```bash
1141-
npm test # 83 unit tests
1142-
npm run test:integration # 26 integration tests
1143-
npm run test:fuzz # 17 fuzz tests
1144-
# Module tests (141 total)
1172+
npm test # Main test suite
1173+
npm run test:integration # Integration tests
1174+
npm run test:fuzz # Fuzz tests
1175+
# Module tests
11451176
node --experimental-vm-modules dist/test-command-classification.js
11461177
node --experimental-vm-modules dist/test-session-version-store.js
11471178
node --experimental-vm-modules dist/test-command-replay-store.js

PROTOCOL.md

Lines changed: 66 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -156,13 +156,14 @@ Optional fields:
156156
Server emits these event families:
157157

158158
1. `server_ready`
159-
2. `server_shutdown`
160-
3. `session_created`
161-
4. `session_deleted`
162-
5. `event` (session-scoped AgentSession event)
163-
6. `command_accepted`
164-
7. `command_started`
165-
8. `command_finished`
159+
2. `startup_recovery_summary` (convenience; endpoint-first flow remains canonical)
160+
3. `server_shutdown`
161+
4. `session_created`
162+
5. `session_deleted`
163+
6. `event` (session-scoped AgentSession event)
164+
7. `command_accepted`
165+
8. `command_started`
166+
9. `command_finished`
166167

167168
### 6.1 Session-scoped AgentSession events
168169

@@ -337,7 +338,7 @@ Conformant clients:
337338

338339
1. No global total ordering (lane determinism only).
339340
2. Timeout does not prove cancellation completed.
340-
3. Durable replay beyond process lifetime is partially implemented behind a feature flag (`durableJournal.enabled`); full history/replay APIs remain Level 4 work.
341+
3. Durable replay beyond process lifetime is feature-flagged (`durableJournal.enabled`). `get_command_history` is available for bounded journal introspection; deterministic replay/export tooling remains Level 4 work.
341342

342343
---
343344

@@ -360,7 +361,39 @@ Emitted on connection before any other messages. Announces server capabilities.
360361

361362
Clients SHOULD check `protocolVersion` for compatibility.
362363

363-
### 16.2 `server_shutdown`
364+
### 16.2 `startup_recovery_summary`
365+
366+
Optional convenience event carrying the same payload schema as `get_startup_recovery`.
367+
368+
By default, servers MAY redact sensitive fields in this event (for example: `journalPath`, `initializationError`, `recoveredOutcomeIds`, `recoveredInFlight`). Full diagnostics remain available via explicit `get_startup_recovery` requests.
369+
370+
```json
371+
{
372+
"type": "startup_recovery_summary",
373+
"data": {
374+
"enabled": true,
375+
"initialized": true,
376+
"initState": "ready",
377+
"initializationError": "Initialization failed (details redacted; call get_startup_recovery for full diagnostics)",
378+
"journalPath": "[redacted]",
379+
"schemaVersion": 1,
380+
"entriesScanned": 0,
381+
"malformedEntries": 0,
382+
"unsupportedVersionEntries": 0,
383+
"recoveredOutcomes": 0,
384+
"recoveredOutcomeIds": [],
385+
"recoveredOutcomeIdsTruncated": false,
386+
"recoveredInFlightFailures": 0,
387+
"recoveredInFlight": [],
388+
"recoveredInFlightTruncated": false,
389+
"maxItemsReturned": 100
390+
}
391+
}
392+
```
393+
394+
This event is advisory. Clients SHOULD continue using `get_startup_recovery` as the canonical endpoint. Deployments that need full event detail can opt in via server configuration (`PiServerOptions.startupRecoverySummaryEvent.includeSensitiveData = true`).
395+
396+
### 16.3 `server_shutdown`
364397

365398
Emitted before server closes. Clients should expect connection termination.
366399

@@ -388,11 +421,35 @@ Emitted before server closes. Clients should expect connection termination.
388421
| `switch_session` | Subscribe to session | `{ sessionInfo }` |
389422
| `get_metrics` | Server metrics | See `get_metrics` response |
390423
| `health_check` | Health status | `{ healthy, issues, hasOpenCircuit, hasOpenBashCircuit }` |
424+
| `get_startup_recovery` | Startup durable-journal recovery summary | `{ enabled, initialized, initState, initializationError?, journalPath, schemaVersion, entriesScanned, malformedEntries, unsupportedVersionEntries, recoveredOutcomes, recoveredOutcomeIds, recoveredOutcomeIdsTruncated, recoveredInFlightFailures, recoveredInFlight, recoveredInFlightTruncated, maxItemsReturned }` |
425+
| `get_command_history` | Bounded durable journal history query (ADR-0019) | `{ enabled, initialized, initState, initializationError?, journalPath, schemaVersion, filters, entries, returned, truncated, maxItemsReturned, maxItemsAllowed }` |
391426
| `list_stored_sessions` | List persisted sessions (ADR-0007) | `{ sessions: StoredSessionInfo[] }` |
392427
| `load_session` | Load session from disk (ADR-0007) | `{ sessionId, sessionInfo }` |
393428

394429
> **Note:** `delete_session` unloads the session from server memory but does NOT delete the session file from disk. The session can be reloaded later via `load_session` or discovered via `list_stored_sessions`.
395430
431+
#### `get_command_history` request fields
432+
433+
```json
434+
{
435+
"type": "get_command_history",
436+
"sessionIdFilter": "session-123",
437+
"commandId": "cmd-42",
438+
"fromTimestamp": 1772412000000,
439+
"toTimestamp": 1772415600000,
440+
"limit": 100
441+
}
442+
```
443+
444+
- `sessionIdFilter` (optional): exact match on journal entry `sessionId`
445+
- `commandId` (optional): exact match on journal entry command ID
446+
- `fromTimestamp` / `toTimestamp` (optional): inclusive time bounds on `recordedAt`
447+
- `limit` (optional): max entries returned (default `100`, hard max `500`)
448+
449+
Entries are returned in append order and may be truncated when `limit` is reached.
450+
Servers MAY also apply internal scan guardrails (line/time budget) and set `truncated: true`
451+
when the query exceeds those bounds.
452+
396453
### 17.2 Session commands (require `sessionId`)
397454

398455
**Discovery:**

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -142,11 +142,11 @@ npm install
142142
npm run build
143143

144144
# Run tests
145-
npm test # Unit tests (83)
146-
npm run test:integration # Integration tests (26)
147-
npm run test:fuzz # Fuzz tests (17)
145+
npm test # Main test suite
146+
npm run test:integration # Integration tests
147+
npm run test:fuzz # Fuzz tests
148148

149-
# Module tests (141)
149+
# Module tests
150150
node --experimental-vm-modules dist/test-command-classification.js
151151
node --experimental-vm-modules dist/test-session-version-store.js
152152
node --experimental-vm-modules dist/test-command-replay-store.js

ROADMAP.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Each unchecked item requires an **owner**, an **acceptance test**, and a **decis
100100
### L4.2 Crash recovery model
101101
- [ ] Rehydrate journal on startup
102102
- [ ] Classify pre-crash in-flight commands (`recoverable` / `failed`) with explicit reason
103-
- [ ] Expose recovery summary event/endpoint
103+
- [x] Expose recovery summary event/endpoint
104104

105105
**Owner:** TBD
106106

@@ -113,7 +113,7 @@ Each unchecked item requires an **owner**, an **acceptance test**, and a **decis
113113

114114
### L4.3 Replay and trace extraction
115115
- [ ] Deterministic replay mode for audit/debug
116-
- [ ] `get_command_history` API (session, commandId, time-window filters)
116+
- [x] `get_command_history` API (session, commandId, time-window filters; bounded response)
117117
- [ ] Redaction-aware export path for incident reports
118118

119119
**Owner:** TBD
@@ -126,9 +126,9 @@ Each unchecked item requires an **owner**, an **acceptance test**, and a **decis
126126
- Replay placement: in-process feature vs offline tool
127127

128128
### L4.4 Retention, compaction, privacy controls
129-
- [ ] Retention policy (time + size)
130-
- [ ] Compaction that preserves replay correctness
131-
- [ ] PII redaction hooks before persistence/export
129+
- [x] Retention policy (time + size)
130+
- [x] Compaction that preserves replay correctness for retained outcomes + in-flight recovery
131+
- [x] PII redaction hooks before persistence/export
132132

133133
**Owner:** TBD
134134

docs/adr/0019-durable-command-journal-foundation.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,18 @@ This preserves rollback safety while validating operational behavior.
8989

9090
## Follow-up work
9191

92-
- retention + compaction with replay-equivalence guarantees
92+
- ✅ recovery summary endpoint (`get_startup_recovery`) in protocol surface
93+
- ✅ recovery summary startup event (`startup_recovery_summary`) in protocol surface (convenience; endpoint remains canonical)
94+
- ✅ bounded history query endpoint (`get_command_history`) with session/command/time filters
95+
- ✅ retention + compaction foundation (`durableJournal.retention` with maxEntries/maxAgeMs/maxBytes), preserving retained replay + in-flight recovery semantics
96+
- ✅ single-writer lock file enforcement to prevent multi-process compaction/append races on one journal path
97+
- ✅ bounded history-query scan guardrails (line/time budget) to avoid unbounded server-lane scans
98+
- ✅ append write-failure strictness policy (`durableJournal.appendFailurePolicy`: `best_effort` / `fail_closed`)
99+
- ✅ redaction hooks for persistence/export surfaces (`durableJournal.redaction.beforePersist` / `beforeExport`)
100+
- ✅ chaos coverage for malformed/partial journal lines around recovery + compaction
93101
- optional SQLite backend evaluation (decision gate revisit)
94-
- recovery summary endpoint/event in protocol surface
95102
- schema migration tooling and fixtures
103+
- deterministic replay/export tooling for incident workflows
96104

97105
## References
98106

0 commit comments

Comments
 (0)