Bound Claude call to a deadline under the API Gateway 30s limit by mattgodbolt-molty · Pull Request #19 · compiler-explorer/explain

mattgodbolt-molty · 2026-06-15T16:27:47Z

(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)

Problem

Users report intermittent 503s from the /explain endpoint. The Lambda itself is completely healthy (Throttles=0, Errors=0) and the app only ever logs 200/307. The 503s come from API Gateway: its HTTP API integration timeout is a hard 30s ceiling that cannot be raised, but the Lambda timeout was set to 60s.

Requests whose generation runs past 30s are severed by the gateway with an opaque {"message":"Service Unavailable"} (33 bytes) while the Lambda keeps running to completion — billing tokens we never deliver.

Quantified impact (CloudWatch, 2026-06-15):

Today: 33/649 requests 503'd (5.1%)
Yesterday: 27/304 requests 503'd (8.9%)
Lambda duration max: 37.8s — well past the 30s ceiling

Fix

Bound each Claude call to a wall-clock budget (default 27s, configurable via ANTHROPIC_TIMEOUT_SECONDS) that sits comfortably below the gateway ceiling. On timeout or a transient upstream failure, return a structured, retryable error response (status="error", ClaudeExplainTimeout metric) instead of letting the gateway emit a 503.

Changes:

config.py: new anthropic_timeout_seconds (27.0) and anthropic_max_retries (2) settings
explain.py: wrap client.messages.create in asyncio.timeout(deadline_seconds), catch TimeoutError/APITimeoutError/APIConnectionError and transient HTTP errors (408/429/5xx/529), return a user-friendly error response
main.py: pass timeout/max_retries to AsyncAnthropic client and thread deadline_seconds through to process_request
test_explain.py: new test covering the deadline path

🤖 Generated by LLM (Claude, via OpenClaw)

Users report intermittent 503s from the explain endpoint. The Lambda itself is healthy (Throttles=0, Errors=0) and the app only ever logs 200/307. The 503s come from API Gateway: its HTTP API integration timeout is a hard 30s ceiling, but the Lambda timeout was raised to 60s. Requests whose generation runs past 30s (218 over the last 4 days, up to ~40s) are severed by the gateway with an opaque 503 while the Lambda keeps running to completion — billing tokens we never deliver. Bound each Claude call to a wall-clock budget (default 27s, configurable via ANTHROPIC_TIMEOUT_SECONDS) that sits comfortably below the gateway ceiling, and set the SDK per-attempt timeout and retry budget to match. On timeout or a transient upstream failure, return a structured, retryable error response (status="error", ClaudeExplainTimeout metric) instead of letting the gateway emit a 503. Genuine client errors (e.g. a 400) still propagate as real failures. 🤖 Generated by LLM (Claude, via OpenClaw) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mattgodbolt · 2026-06-15T16:31:13Z

Cc @dkm I had this branch hanging around...will be investigating other ways to speed up. Well, Claude will

Copilot

Pull request overview

This pull request mitigates intermittent API Gateway–originated 503s on the /explain endpoint by ensuring Claude generation work is bounded to a wall-clock budget below the gateway’s fixed 30s integration timeout, returning a structured error response instead of letting the gateway sever the connection.

Changes:

Added configurable Anthropic timeout and retry settings (anthropic_timeout_seconds, anthropic_max_retries) with a default 27s budget.
Wrapped client.messages.create(...) with an asyncio.timeout(...) wall-clock deadline and converted timeout/transient upstream failures into a structured status="error" response plus metrics.
Updated FastAPI app startup to configure the Anthropic client timeout/retries and threaded the deadline through request processing; added a test covering the deadline path.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
app/config.py	Introduces Anthropic timeout/retry settings to keep requests under API Gateway limits.
app/explain.py	Enforces a wall-clock deadline around Claude calls and returns structured retryable errors on transient failures.
app/main.py	Configures `AsyncAnthropic` with timeout/retries and passes the deadline into `process_request`.
app/test_explain.py	Adds an async test ensuring overrunning the deadline returns a structured error response.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    message_text = (
+        "Claude Explain could not generate an explanation in time. "
+        "This usually means the model was busy or the input was very large; "
+        "please try again in a moment."
+    )


Copilot review noted the original single message ('could not generate an explanation in time') was inaccurate for non-timeout failures like connection errors or transient 5xx responses. Now returns a timeout-specific message for TimeoutError/APITimeoutError and a generic 'temporarily unavailable' message for everything else. 🤖 Generated by LLM (Claude, via OpenClaw)

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

+    if isinstance(error, (TimeoutError, APITimeoutError)):
+        message_text = (
+            "Claude Explain took too long to respond — the input may be very large "
+            "or the model is under heavy load. Please try again in a moment."
+        )
+    else:
+        message_text = "Claude Explain is temporarily unavailable. Please try again in a moment."
+    LOGGER.warning("Anthropic call failed (%s): %s", type(error).__name__, error)
+    metrics_provider.set_property("language", body.language)
+    metrics_provider.set_property("compiler", body.compiler)
+    metrics_provider.set_property("instructionSet", body.instructionSet or "unknown")
+    metrics_provider.set_property("cached", "false")
+    metrics_provider.put_metric("ClaudeExplainRequest", 1)
+    metrics_provider.put_metric("ClaudeExplainTimeout", 1)
+    return ExplainResponse(


Copilot review noted ClaudeExplainTimeout was being emitted for all transient failures (connection errors, 5xx, etc.), making it impossible to distinguish real timeout rate from upstream error rate in dashboards. Now emits ClaudeExplainTimeout only for TimeoutError/APITimeoutError, and ClaudeExplainTransientError for everything else. 🤖 Generated by LLM (Claude, via OpenClaw)

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

mattgodbolt requested a review from Copilot June 15, 2026 16:28

Copilot started reviewing on behalf of mattgodbolt June 15, 2026 16:29 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread app/explain.py Outdated

Comment on lines +92 to +96

message_text = (

"Claude Explain could not generate an explanation in time. "

"This usually means the model was busy or the input was very large; "

"please try again in a moment."

)

mattgodbolt requested a review from Copilot June 15, 2026 16:36

Copilot started reviewing on behalf of mattgodbolt June 15, 2026 16:36 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

mattgodbolt requested a review from Copilot June 15, 2026 16:40

Copilot started reviewing on behalf of mattgodbolt June 15, 2026 16:41 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

mattgodbolt merged commit 0d9ffa8 into main Jun 15, 2026
3 checks passed

mattgodbolt deleted the fix/explain-503-timeout branch June 15, 2026 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bound Claude call to a deadline under the API Gateway 30s limit#19

Bound Claude call to a deadline under the API Gateway 30s limit#19
mattgodbolt merged 3 commits into
mainfrom
fix/explain-503-timeout

mattgodbolt-molty commented Jun 15, 2026

Uh oh!

mattgodbolt commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mattgodbolt-molty commented Jun 15, 2026

Problem

Fix

Uh oh!

mattgodbolt commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants