Skip to content

fix(webhooks): durable delivery queue + re-drive (audit #3)#98

Merged
brownjuly2003-code merged 1 commit into
mainfrom
fix/webhook-durable-redrive
Jun 28, 2026
Merged

fix(webhooks): durable delivery queue + re-drive (audit #3)#98
brownjuly2003-code merged 1 commit into
mainfrom
fix/webhook-durable-redrive

Conversation

@brownjuly2003-code

Copy link
Copy Markdown
Owner

Audit #3 — webhook durable re-drive

Problem

The dispatcher marked an event seen (in-memory) before attempting delivery, and delivery happened inline with a bounded retry burst. A delivery that failed every attempt — or any event in flight when the process restarted — was silently dropped (audit_28_06_26.md #3): the in-memory seen-set is rebuilt from existing events on restart and never re-attempts. This violates at-least-once delivery.

Fix

Add a durable per-(webhook, event) delivery queue in DuckDB (webhook_delivery_queue, PK (webhook_id, event_id)) alongside the existing append-only webhook_deliveries log:

  • dispatch_new_events enqueues each matching delivery durably, attempts it inline (unchanged happy-path latency), then records the outcome.
  • process_delivery_queue re-drives due pending rows each loop pass with a backoff schedule, parks a row dead after max_delivery_attempts, and parks (does not retry forever) a delivery whose webhook was removed or deactivated. The stored canonical body lets a delivery be replayed without re-reading pipeline_events, so it survives a restart.

mark-seen-on-scan is left untouched: it still drives event-driven metric cache invalidation (main.py wraps dispatch on seen-set growth). deliver() keeps its return shape and 3-attempt burst for the /test endpoint — the durable queue is layered over inline delivery, not a replacement.

Verification (no-Docker, on Windows)

  • ruff / ruff format / mypy clean
  • full unit suite 1135 passed / 1 skipped
  • webhook unit + integration 34 passed, including new tests for enqueue idempotency, the outcome state machine, re-drive of due / dead-at-max / not-due / removed-webhook rows, survival across a fresh dispatcher instance (restart), and an end-to-end failed-then-redriven delivery.

🤖 Generated with Claude Code

The dispatcher marked an event seen (in-memory) before attempting
delivery, and delivery happened inline with a bounded retry burst. A
delivery that failed every attempt -- or any event in flight when the
process restarted -- was silently dropped (audit_28_06_26.md #3): the
in-memory seen-set is rebuilt from existing events on restart and never
re-attempts.

Add a durable per-(webhook, event) delivery queue in DuckDB
(webhook_delivery_queue, PK (webhook_id, event_id)) alongside the
existing append-only webhook_deliveries log:
- dispatch_new_events enqueues each matching delivery durably, attempts
  it inline (unchanged happy-path latency), then records the outcome.
- process_delivery_queue re-drives due 'pending' rows each loop pass with
  a backoff schedule, parks a row 'dead' after max_delivery_attempts, and
  parks (does not retry forever) a delivery whose webhook was removed or
  deactivated. The stored canonical body lets a delivery be replayed
  without re-reading pipeline_events, so it survives a restart.

mark-seen-on-scan is left untouched: it still drives event-driven metric
cache invalidation (main.py wraps dispatch on seen-set growth). deliver()
keeps its return shape and 3-attempt burst for the /test endpoint.

Verification (no-Docker, on Windows): ruff / ruff format / mypy clean;
full unit suite 1135 passed / 1 skipped; webhook unit+integration 34
passed including new tests for enqueue idempotency, the outcome state
machine, re-drive of due / dead-at-max / not-due / removed-webhook rows,
survival across a fresh dispatcher instance (restart), and an end-to-end
failed-then-redriven delivery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

DORA Metrics

  • Window: last 30 days
  • Branch: main
  • Deployment frequency: 187 total / 43.63 per week
  • Lead time for changes: avg 0.24h / median 0.0h
  • Change failure rate: 63.64% (119/187)
  • MTTR: 0.23h across 4 incident(s)

@brownjuly2003-code brownjuly2003-code merged commit 86a1c89 into main Jun 28, 2026
23 of 24 checks passed
@brownjuly2003-code brownjuly2003-code deleted the fix/webhook-durable-redrive branch June 28, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants