feat: thread continuity, graceful shutdown, event dedup, and retry hardening#8
Open
braghettos wants to merge 1 commit intokagent-dev:mainfrom
Open
Conversation
…rdening ## Thread → A2A task mapping - Slack threads now map to A2A tasks via in-memory dict (thread_key → task_id) - Follow-up @mentions in the same thread continue the existing A2A task, giving the agent full conversation history - When a completed task is continued (kagent returns 500 "terminal state"), the bot falls back to creating a new task with reference_task_ids ## Full message extraction - New extract_full_message_text() merges content from event["text"], event["blocks"], and event["attachments"] - Critical for webhook integrations (HyperDX, PagerDuty) that embed alert details in Block Kit rather than plain text ## Slack SDK retry handlers - Enable AsyncServerErrorRetryHandler (3 retries) for HTTP 500/503 - Bump AsyncConnectionErrorRetryHandler to 3 retries (default was 1) - Add AsyncRateLimitErrorRetryHandler (2 retries) for HTTP 429 - The SDK ships these handlers but does NOT enable them by default ## Graceful shutdown - SIGTERM handler immediately disconnects Socket Mode so the dying pod stops receiving events during rolling updates - Prevents duplicate event processing from zombie WebSocket sessions ## Event deduplication - Track event_ts in _processed_events set to skip Socket Mode retries - Slack re-delivers events if the envelope ACK takes >30s; A2A calls take minutes, so every event was being processed twice ## Deployment hardening - Dockerfile: add ENV PYTHONUNBUFFERED=1 (without this, Python buffers stdout and container logs appear empty) - k8s/deployment.yaml: add Recreate strategy (prevents two pods competing for the same Socket Mode connection), add PYTHONUNBUFFERED env var ## httpx client reuse - Single module-level httpx.AsyncClient for A2A calls (was creating a new one per request, leaking TCP connection pools) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production-hardened the Slack bot based on real-world debugging of a Krateo observability pipeline (HyperDX alerts → Slack → KAgent autopilot → ClickHouse investigation → response).
Every fix addresses an issue encountered in production:
Thread → A2A Task Continuity
@botreplies in the same Slack thread now continue the same A2A task, giving the agent full conversation historythread_key → task_idmapping with 24h TTLreference_task_idspointing to the old oneSlack SDK Retry Handlers
AsyncServerErrorRetryHandler(3 retries) — the SDK ships this but does not enable it by default, causing every transient HTTP 503 from Slack to hard-failAsyncConnectionErrorRetryHandlerto 3 retries (default is 1)AsyncRateLimitErrorRetryHandlerfor HTTP 429Full Message Extraction
extract_full_message_text()merges content fromevent["text"],event["blocks"], andevent["attachments"]Graceful Shutdown (SIGTERM)
Event Deduplication
event_tsto skip Socket Mode retriesDeployment Hardening
ENV PYTHONUNBUFFERED=1— without this, Python buffers stdout andkubectl logsshows nothingstrategy: Recreateprevents two pods competing for the same Socket Mode connection; addedPYTHONUNBUFFEREDenv varhttpx.AsyncClientfor A2A calls (was leaking TCP pools)Test plan
kubectl logs(PYTHONUNBUFFERED)🤖 Generated with Claude Code