feat: thread continuity, graceful shutdown, event dedup, and retry hardening by braghettos · Pull Request #8 · kagent-dev/a2a-slack-template

braghettos · 2026-03-21T22:44:40Z

Summary

Production-hardened the Slack bot based on real-world debugging of a Krateo observability pipeline (HyperDX alerts → Slack → KAgent autopilot → ClickHouse investigation → response).

Every fix addresses an issue encountered in production:

Thread → A2A Task Continuity

Follow-up @bot replies in the same Slack thread now continue the same A2A task, giving the agent full conversation history
In-memory thread_key → task_id mapping with 24h TTL
Graceful fallback: when a completed task can't accept new messages (kagent returns 500 "terminal state"), creates a new task with reference_task_ids pointing to the old one

Slack SDK Retry Handlers

Enable AsyncServerErrorRetryHandler (3 retries) — the SDK ships this but does not enable it by default, causing every transient HTTP 503 from Slack to hard-fail
Bump AsyncConnectionErrorRetryHandler to 3 retries (default is 1)
Add AsyncRateLimitErrorRetryHandler for HTTP 429

Full Message Extraction

New extract_full_message_text() merges content from event["text"], event["blocks"], and event["attachments"]
Without this, webhook integrations (HyperDX, PagerDuty) that put alert details in Block Kit have their data silently dropped

Graceful Shutdown (SIGTERM)

On SIGTERM, immediately disconnect Socket Mode so the dying pod stops receiving events
Without this, during pod rollouts Slack delivers events to the dying pod, which starts processing but gets killed mid-A2A-call → 503

Event Deduplication

Track event_ts to skip Socket Mode retries
Slack re-delivers events if envelope ACK takes >30s; A2A calls take minutes → every event was processed twice without this

Deployment Hardening

Dockerfile: ENV PYTHONUNBUFFERED=1 — without this, Python buffers stdout and kubectl logs shows nothing
k8s/deployment.yaml: strategy: Recreate prevents two pods competing for the same Socket Mode connection; added PYTHONUNBUFFERED env var
Reuse single httpx.AsyncClient for A2A calls (was leaking TCP pools)

Test plan

Alert fires → bot investigates → posts response in thread (no 503)
Reply in thread → bot continues same A2A task with history
Reply to completed task → falls back to new task with reference
Pod restart → no duplicate events, no zombie WebSocket processing
Logs visible immediately via kubectl logs (PYTHONUNBUFFERED)

🤖 Generated with Claude Code

…rdening ## Thread → A2A task mapping - Slack threads now map to A2A tasks via in-memory dict (thread_key → task_id) - Follow-up @mentions in the same thread continue the existing A2A task, giving the agent full conversation history - When a completed task is continued (kagent returns 500 "terminal state"), the bot falls back to creating a new task with reference_task_ids ## Full message extraction - New extract_full_message_text() merges content from event["text"], event["blocks"], and event["attachments"] - Critical for webhook integrations (HyperDX, PagerDuty) that embed alert details in Block Kit rather than plain text ## Slack SDK retry handlers - Enable AsyncServerErrorRetryHandler (3 retries) for HTTP 500/503 - Bump AsyncConnectionErrorRetryHandler to 3 retries (default was 1) - Add AsyncRateLimitErrorRetryHandler (2 retries) for HTTP 429 - The SDK ships these handlers but does NOT enable them by default ## Graceful shutdown - SIGTERM handler immediately disconnects Socket Mode so the dying pod stops receiving events during rolling updates - Prevents duplicate event processing from zombie WebSocket sessions ## Event deduplication - Track event_ts in _processed_events set to skip Socket Mode retries - Slack re-delivers events if the envelope ACK takes >30s; A2A calls take minutes, so every event was being processed twice ## Deployment hardening - Dockerfile: add ENV PYTHONUNBUFFERED=1 (without this, Python buffers stdout and container logs appear empty) - k8s/deployment.yaml: add Recreate strategy (prevents two pods competing for the same Socket Mode connection), add PYTHONUNBUFFERED env var ## httpx client reuse - Single module-level httpx.AsyncClient for A2A calls (was creating a new one per request, leaking TCP connection pools) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: thread continuity, graceful shutdown, event dedup, and retry hardening#8

feat: thread continuity, graceful shutdown, event dedup, and retry hardening#8
braghettos wants to merge 1 commit intokagent-dev:mainfrom
braghettos:fix/thread-continuity-graceful-shutdown

braghettos commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

braghettos commented Mar 21, 2026

Summary

Thread → A2A Task Continuity

Slack SDK Retry Handlers

Full Message Extraction

Graceful Shutdown (SIGTERM)

Event Deduplication

Deployment Hardening

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant