Skip to content

v0.4.0: Fix correctness bugs, add lossless reconnect, outbound track…#3

Merged
ajit-zer07 merged 1 commit intomainfrom
fix-correctness-bugs
Mar 21, 2026
Merged

v0.4.0: Fix correctness bugs, add lossless reconnect, outbound track…#3
ajit-zer07 merged 1 commit intomainfrom
fix-correctness-bugs

Conversation

@ajit-zer07
Copy link
Contributor

Summary

  • Fix premature run finalization: Commitment messages no longer synthesize SESSION_STATE_RESOLVED — runs only complete
    when the runtime reports resolution via session-snapshot or GetSession reconciliation
  • Lossless reconnect: Stream cursor persisted after each event; recovery resumes from max(persisted_cursor, last_event_seq)
  • Outbound message tracking: New run_outbound_messages table tracks all kickoff/signal/context messages with lifecycle
    status
  • Operational resilience: Durable webhook outbox, distributed recovery locking via advisory locks, batch recovery reporting,
    6 new Prometheus counters
  • New endpoints: POST /runs/validate (preflight), POST /runs/:id/context, POST /runs/:id/projection/rebuild, GET /runs/:id/messages, GET /runs/:id/export/stream (JSONL)
  • Horizontal scaling: Redis pub/sub StreamHubStrategy for multi-instance SSE fan-out

Changes by Phase

Phase 0: Critical Correctness Fixes

Fix File(s)
clientVersion from package.json app-config.service.ts, run-executor.service.ts
Remove premature Commitment → RESOLVED event-normalizer.service.ts
Remove decision.finalized auto-finalize stream-consumer.service.ts
Atomic finalization via finalizingPromise stream-consumer.service.ts
findByIdOrThrowNotFoundException run.repository.ts
console.error → NestJS Logger main.ts

Phase 1: Runtime Integration Fidelity

  • POST /runs/:id/context — send context updates to running sessions
  • POST /runs/:id/projection/rebuild — rebuild projection from persisted events
  • Persist runtime capabilities from Initialize response (4 new columns on runtime_sessions)
  • schema_version column on run_events_canonical
  • Fix webhooks.active from integer to boolean

Phase 2: Lossless Reconnect & Outbound Tracking

  • Stream cursor persistence for crash recovery
  • run_outbound_messages table + repository + GET /runs/:id/messages
  • OutboundMessageSummary in RunStateProjection

Phase 3: Signal & Progress Enrichment

  • signalType, severity fields on SendSignalDto
  • TaskUpdate/TaskComplete/TaskFail → additional progress.reported events

Phase 4: Operational Resilience

  • Recovery batch result reporting { recovered, failed }
  • Durable webhook outbox with webhook_deliveries table
  • Advisory lock–based distributed recovery (pg_try_advisory_lock)
  • 6 new Prometheus counters (outbound, inbound, signals, reconnects, recovery, webhooks)

Phase 5: New Features & Scalability

  • POST /runs/validate — preflight validation without creating a run
  • RedisStreamHubStrategy for horizontal SSE scaling
  • GET /runs/:id/export/stream — streaming JSONL export
  • src/runtime/grpc-types.ts — typed gRPC interfaces

Migrations

File Purpose
0006_capabilities_and_stream.sql capabilities + stream cursor on runtime_sessions
0007_canonical_schema_version.sql schema_version on run_events_canonical
0008_webhook_active_boolean.sql integer → boolean for webhooks.active
0009_outbound_messages.sql run_outbound_messages table
0010_webhook_deliveries.sql webhook_deliveries table

New Files

File Purpose
src/storage/outbound-message.repository.ts CRUD for run_outbound_messages
src/webhooks/webhook-delivery.repository.ts CRUD for webhook_deliveries
src/events/redis-stream-hub.strategy.ts Redis pub/sub StreamHubStrategy
src/runtime/grpc-types.ts Typed interfaces for gRPC shapes

Test plan

  • 253 tests passing (up from 248), 20 suites
  • TypeScript compiles cleanly (tsc --noEmit)
  • 0 lint errors
  • Run npm run drizzle:migrate against a live database
  • POST /runs with decision-mode request → verify run completes only via runtime session-snapshot
  • Kill and restart control plane during active run → verify recovery resumes from persisted cursor
  • POST /runs/validate with unsupported mode → verify error response
  • POST /runs/:id/context during running session → verify context update flows through

…ing, and operational resilience

  Fix premature run finalization where Commitment messages synthesized
  SESSION_STATE_RESOLVED without runtime confirmation, and decision.finalized
  events auto-completed runs. Runs now only finalize via runtime authority
  (session-snapshot or GetSession reconciliation).

  Phase 0 — Critical fixes:
  - Read clientVersion from package.json instead of hardcoded '0.2.0'
  - Remove synthetic session.state.changed from Commitment normalization
  - Remove decision.finalized auto-finalize in stream consumer
  - Add finalizingPromise to prevent race conditions in concurrent finalization
  - findByIdOrThrow now throws NotFoundException instead of plain Error
  - Replace console.error with NestJS Logger in bootstrap

  Phase 1 — Runtime integration:
  - Implement POST /runs/:id/context endpoint
  - Implement POST /runs/:id/projection/rebuild endpoint
  - Persist runtime capabilities from Initialize response
  - Add schema_version column to canonical events
  - Fix webhook active column from integer to boolean

  Phase 2 — Lossless reconnect & outbound tracking:
  - Persist stream cursor after each event for lossless reconnect
  - Recovery reads persisted cursor for accurate resume position
  - Add run_outbound_messages table and repository
  - Add GET /runs/:id/messages endpoint
  - Add outboundMessages summary to RunStateProjection

  Phase 3 — Signal & progress enrichment:
  - Add signalType and severity fields to SendSignalDto
  - Emit progress.reported for TaskUpdate, TaskComplete, TaskFail

  Phase 4 — Operational resilience:
  - Recovery returns batch result summary {recovered, failed}
  - Durable webhook outbox with delivery tracking table
  - Distributed recovery locking via PostgreSQL advisory locks
  - Add 6 new Prometheus counters for observability

  Phase 5 — New features & scalability:
  - POST /runs/validate preflight endpoint
  - Redis StreamHub strategy for horizontal scaling
  - Streaming JSONL export via async generator
  - Typed gRPC interfaces to reduce any casts
@ajit-zer07 ajit-zer07 merged commit e3d9f5d into main Mar 21, 2026
5 checks passed
@ajit-zer07 ajit-zer07 deleted the fix-correctness-bugs branch March 21, 2026 02:53
@ajit-zer07 ajit-zer07 restored the fix-correctness-bugs branch March 22, 2026 18:43
@ajit-zer07 ajit-zer07 deleted the fix-correctness-bugs branch March 22, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant