Skip to content

fix: enable SDK-level retry for Slack 503s and extract full message content#7

Open
braghettos wants to merge 1 commit intokagent-dev:mainfrom
braghettos:fix/slack-503-retry-and-message-extraction
Open

fix: enable SDK-level retry for Slack 503s and extract full message content#7
braghettos wants to merge 1 commit intokagent-dev:mainfrom
braghettos:fix/slack-503-retry-and-message-extraction

Conversation

@braghettos
Copy link

Summary

  • Enable AsyncServerErrorRetryHandler on the AsyncWebClient — the Slack SDK ships this handler but does not enable it by default, causing every transient HTTP 503 from Slack to hard-fail with SlackApiError
  • Add AsyncRateLimitErrorRetryHandler for HTTP 429 and bump AsyncConnectionErrorRetryHandler to 3 retries (default is 1)
  • Add extract_full_message_text() to read full event content from blocks and attachments, not just the plain-text event["text"] fallback — critical for webhook integrations (e.g. HyperDX, PagerDuty) that embed alert details in block kit
  • Reuse a single httpx.AsyncClient for A2A calls instead of creating one per request (prevents TCP connection pool leak)

Problem

When Slack returns a transient 503, the SDK's default AsyncConnectionErrorRetryHandler does not catch it because 503 is an HTTP-level status code, not a TCP connection exception (ServerConnectionError, ClientOSError). The SDK's built-in AsyncServerErrorRetryHandler handles exactly this case (retries on 500/503) but is not included in async_default_handlers().

This causes intermittent failures like:

Error: HTTP Error 503: Network communication error: All connection attempts failed

Changes

main.py

  • Construct AsyncWebClient with explicit retry handlers and inject into AsyncApp
  • AsyncServerErrorRetryHandler(max_retry_count=3) — HTTP 500/503
  • AsyncConnectionErrorRetryHandler(max_retry_count=3) — TCP failures
  • AsyncRateLimitErrorRetryHandler(max_retry_count=2) — HTTP 429

handlers.py

  • New extract_full_message_text() — merges content from event["text"], event["blocks"], and event["attachments"]
  • handle_app_mention now passes the full alert context to the A2A agent
  • Module-level httpx.AsyncClient reuse (was creating a new one per invocation)
  • Code cleanup: type hints, module-level logger, removed unused imports

Test plan

  • Verified all 3 retry handlers are active in running pod
  • Confirmed extract_full_message_text() correctly extracts HyperDX webhook alert content from blocks/attachments
  • Tested outbound Slack API connectivity with 10 rapid calls (all 200 OK)
  • Monitor for 503 errors over next 24h — should be silently retried by SDK

🤖 Generated with Claude Code

…ontent

## Problem

The Slack SDK's default retry configuration only handles TCP connection
errors (AsyncConnectionErrorRetryHandler, 1 retry). HTTP 503 responses
— which Slack returns transiently under load — are treated as final
results and immediately raise SlackApiError. This causes chat_postMessage
to fail intermittently with:

  Error: HTTP Error 503: Network communication error:
  All connection attempts failed

Additionally, the handle_app_mention handler only reads event["text"]
(the plain-text fallback), missing rich content from blocks and
attachments that webhook integrations like HyperDX include.

## Root cause

The SDK ships AsyncServerErrorRetryHandler (retries HTTP 500/503) but
does NOT enable it by default. The existing
AsyncConnectionErrorRetryHandler only catches aiohttp connection
exceptions (ServerConnectionError, ClientOSError), not HTTP status codes.

## Fix

**main.py:**
- Create AsyncWebClient with three explicit retry handlers:
  - AsyncServerErrorRetryHandler (3 retries) — HTTP 500/503
  - AsyncConnectionErrorRetryHandler (3 retries) — TCP failures
  - AsyncRateLimitErrorRetryHandler (2 retries) — HTTP 429
- Inject the configured client into AsyncApp

**handlers.py:**
- Add extract_full_message_text() to merge text from event["text"],
  event["blocks"], and event["attachments"] — gives downstream agents
  the complete alert context
- Reuse a single httpx.AsyncClient for A2A calls (prevents connection
  pool leak from creating a new client per request)
- Clean up code structure, use module-level logger, add type hints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant