Problem
Transient network errors (e.g. httpx.ConnectError) during Claude API calls immediately fail to the user with a generic error message. There is no retry logic at any layer:
sdk_integration.py:execute_command() — catches Exception but re-raises immediately
facade.py:run_command() — only retries on "no conversation found" (stale session), not network errors
orchestrator.py:agentic_text() — catches, formats, and sends error to user
Observed behavior
A momentary network blip causes:
which surfaces to the Telegram user as a generic error, even though retrying 2-3 seconds later would succeed.
Proposed solution
Add retry with exponential backoff in sdk_integration.py:execute_command() for transient/retryable errors:
httpx.ConnectError, httpx.TimeoutException
- Possibly
asyncio.TimeoutError (currently raised as ClaudeTimeoutError with no retry)
Suggested approach:
- 2-3 retries with exponential backoff (e.g. 1s, 3s, 9s)
- Only retry on transport-level errors, not application errors (rate limits, auth failures, etc.)
- Consider adding
tenacity as a dependency, or implement a simple async retry loop
- Log each retry attempt at warning level
This is the right layer because execute_command() is the single chokepoint for all Claude API calls. Retrying higher up risks duplicating session side-effects.
Problem
Transient network errors (e.g.
httpx.ConnectError) during Claude API calls immediately fail to the user with a generic error message. There is no retry logic at any layer:sdk_integration.py:execute_command()— catchesExceptionbut re-raises immediatelyfacade.py:run_command()— only retries on"no conversation found"(stale session), not network errorsorchestrator.py:agentic_text()— catches, formats, and sends error to userObserved behavior
A momentary network blip causes:
which surfaces to the Telegram user as a generic error, even though retrying 2-3 seconds later would succeed.
Proposed solution
Add retry with exponential backoff in
sdk_integration.py:execute_command()for transient/retryable errors:httpx.ConnectError,httpx.TimeoutExceptionasyncio.TimeoutError(currently raised asClaudeTimeoutErrorwith no retry)Suggested approach:
tenacityas a dependency, or implement a simple async retry loopThis is the right layer because
execute_command()is the single chokepoint for all Claude API calls. Retrying higher up risks duplicating session side-effects.