Skip to content

feat: thread continuity, graceful shutdown, event dedup, and retry hardening#8

Open
braghettos wants to merge 1 commit intokagent-dev:mainfrom
braghettos:fix/thread-continuity-graceful-shutdown
Open

feat: thread continuity, graceful shutdown, event dedup, and retry hardening#8
braghettos wants to merge 1 commit intokagent-dev:mainfrom
braghettos:fix/thread-continuity-graceful-shutdown

Conversation

@braghettos
Copy link

Summary

Production-hardened the Slack bot based on real-world debugging of a Krateo observability pipeline (HyperDX alerts → Slack → KAgent autopilot → ClickHouse investigation → response).

Every fix addresses an issue encountered in production:

Thread → A2A Task Continuity

  • Follow-up @bot replies in the same Slack thread now continue the same A2A task, giving the agent full conversation history
  • In-memory thread_key → task_id mapping with 24h TTL
  • Graceful fallback: when a completed task can't accept new messages (kagent returns 500 "terminal state"), creates a new task with reference_task_ids pointing to the old one

Slack SDK Retry Handlers

  • Enable AsyncServerErrorRetryHandler (3 retries) — the SDK ships this but does not enable it by default, causing every transient HTTP 503 from Slack to hard-fail
  • Bump AsyncConnectionErrorRetryHandler to 3 retries (default is 1)
  • Add AsyncRateLimitErrorRetryHandler for HTTP 429

Full Message Extraction

  • New extract_full_message_text() merges content from event["text"], event["blocks"], and event["attachments"]
  • Without this, webhook integrations (HyperDX, PagerDuty) that put alert details in Block Kit have their data silently dropped

Graceful Shutdown (SIGTERM)

  • On SIGTERM, immediately disconnect Socket Mode so the dying pod stops receiving events
  • Without this, during pod rollouts Slack delivers events to the dying pod, which starts processing but gets killed mid-A2A-call → 503

Event Deduplication

  • Track event_ts to skip Socket Mode retries
  • Slack re-delivers events if envelope ACK takes >30s; A2A calls take minutes → every event was processed twice without this

Deployment Hardening

  • Dockerfile: ENV PYTHONUNBUFFERED=1 — without this, Python buffers stdout and kubectl logs shows nothing
  • k8s/deployment.yaml: strategy: Recreate prevents two pods competing for the same Socket Mode connection; added PYTHONUNBUFFERED env var
  • Reuse single httpx.AsyncClient for A2A calls (was leaking TCP pools)

Test plan

  • Alert fires → bot investigates → posts response in thread (no 503)
  • Reply in thread → bot continues same A2A task with history
  • Reply to completed task → falls back to new task with reference
  • Pod restart → no duplicate events, no zombie WebSocket processing
  • Logs visible immediately via kubectl logs (PYTHONUNBUFFERED)

🤖 Generated with Claude Code

…rdening

## Thread → A2A task mapping
- Slack threads now map to A2A tasks via in-memory dict (thread_key → task_id)
- Follow-up @mentions in the same thread continue the existing A2A task,
  giving the agent full conversation history
- When a completed task is continued (kagent returns 500 "terminal state"),
  the bot falls back to creating a new task with reference_task_ids

## Full message extraction
- New extract_full_message_text() merges content from event["text"],
  event["blocks"], and event["attachments"]
- Critical for webhook integrations (HyperDX, PagerDuty) that embed alert
  details in Block Kit rather than plain text

## Slack SDK retry handlers
- Enable AsyncServerErrorRetryHandler (3 retries) for HTTP 500/503
- Bump AsyncConnectionErrorRetryHandler to 3 retries (default was 1)
- Add AsyncRateLimitErrorRetryHandler (2 retries) for HTTP 429
- The SDK ships these handlers but does NOT enable them by default

## Graceful shutdown
- SIGTERM handler immediately disconnects Socket Mode so the dying pod
  stops receiving events during rolling updates
- Prevents duplicate event processing from zombie WebSocket sessions

## Event deduplication
- Track event_ts in _processed_events set to skip Socket Mode retries
- Slack re-delivers events if the envelope ACK takes >30s; A2A calls
  take minutes, so every event was being processed twice

## Deployment hardening
- Dockerfile: add ENV PYTHONUNBUFFERED=1 (without this, Python buffers
  stdout and container logs appear empty)
- k8s/deployment.yaml: add Recreate strategy (prevents two pods competing
  for the same Socket Mode connection), add PYTHONUNBUFFERED env var

## httpx client reuse
- Single module-level httpx.AsyncClient for A2A calls (was creating a new
  one per request, leaking TCP connection pools)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant