Retry Anthropic api_error (Internal Server Error) instead of breaking the loop (Fixes #2053)#2063
Conversation
… the loop (Fixes #2053) Anthropic occasionally returns a body-level api_error ({"type":"error","error":{"type":"api_error","message":"Internal server error"},...}) that carries no HTTP status code. These errors broke the agent loop and also caused context compression to fail permanently, because the retry layer never classified them as transient. isOverloadError() is the single chokepoint used by the main request retry path (RetryOrchestrator), the generic retryWithBackoff classifier, and compression's isTransientCompressionError(). Extend it to treat api_error as retryable alongside overloaded_error and rate_limit_error. The Anthropic SDK wraps stream "error" events as new APIError(undefined, body, ...), storing the whole body on its .error property, so the meaningful type is nested at error.error.error.type while the intermediate error.error.type is the generic "error" envelope. Resolve the type from the deepest available position (error.error.error.type ?? error.error.type ?? error.type) so both the raw body shape and the real SDK-wrapped shape are detected. Deterministic types (invalid_request_error, authentication_error, etc.) and a bare {type:"error"} envelope remain non-retryable. Also wire isOverloadError() into isRetryableError(), the default shouldRetryOnError predicate for retryWithBackoff, so no-status Anthropic body errors retry on the default request path (used by baseLlmClient / clientLlmUtilities) and not only via bucket failover. Tests: - core retry.test: isOverloadError across raw and SDK-wrapped shapes for all three types, negative cases, and a behavioral retryWithBackoff default-predicate retry-then-succeed for SDK-wrapped api_error. - providers RetryOrchestrator.test: behavioral retry-then-succeed on an SDK-wrapped api_error. - agents compression-retry.test: isTransientCompressionError for raw and SDK-wrapped api_error, plus a performCompression retry-then-succeed.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📜 Recent review details⏰ Context from checks skipped due to timeout of 270000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
🧰 Additional context used🧠 Learnings (5)📚 Learning: 2026-02-06T15:52:42.315ZApplied to files:
📚 Learning: 2026-03-26T00:49:43.150ZApplied to files:
📚 Learning: 2026-03-31T02:12:43.093ZApplied to files:
📚 Learning: 2026-06-10T18:18:08.545ZApplied to files:
📚 Learning: 2026-06-10T18:18:09.253ZApplied to files:
🔇 Additional comments (4)
Summary by CodeRabbitRelease Notes
Walkthrough
Changesapi_error Retry Classification
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📋 Issue PlannerBuilt with CodeRabbit's Coding Plans for faster development and fewer bugs. View plan used: ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
LLxprt PR Review – PR #2063Issue AlignmentIssue #2053 reported that Anthropic
File-level evidence:
Verdict: The fix is well-targeted and addresses the issue directly. Side Effects
Code QualityCorrectness:
Error Handling:
Maintainability:
Warnings (4 warnings, 0 errors):
Tests and CoverageCoverage impact: Increase New/modified tests across all 4 changed files:
Test quality: The tests cover the real error shape from issue #2053 (SDK-wrapped All 123 tests pass. VerdictReady The PR correctly addresses issue #2053 by adding |
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-24.x-ubuntu-latest' artifact from the main CI run. |
Summary
Fixes #2053.
Anthropic occasionally returns a body-level
api_error(HTTP-statusless):Because this error carries no HTTP status code, the retry layer never classified it as transient. The result was that an Anthropic "Internal server error" broke the agent loop and also caused context compression to fail permanently, instead of being retried like a 429 / intermittent error.
Root cause
isOverloadError()(inpackages/core/src/utils/retry.ts) is the single chokepoint used by:RetryOrchestrator.shouldRetryError),retryWithBackoffclassifier, andisTransientCompressionError().It only recognized
overloaded_errorandrate_limit_error, soapi_errorfell through every retry path.There is also a subtlety in the real error shape. The Anthropic SDK throws stream
errorevents asnew APIError(undefined, body, ...)and stores the entire body on the error's.errorproperty. So the production error object looks like:The meaningful, retryable type is nested at
error.error.error.type; the intermediateerror.error.typeis just the generic"error"envelope. The raw (un-wrapped) body shape instead exposes the type aterror.error.type.Changes
isOverloadError(): now resolves the body type from the deepest available position —error.error.error.type ?? error.error.type ?? error.type— and treatsapi_erroras retryable alongsideoverloaded_errorandrate_limit_error. This detects both the raw body shape and the real SDK-wrapped shape. Deterministic Anthropic types (invalid_request_error,authentication_error, etc.) and a bare{type:"error"}envelope remain non-retryable.isRetryableError(): the defaultshouldRetryOnErrorpredicate forretryWithBackoffnow consultsisOverloadError(), so no-status Anthropic body errors retry on the default request path (used bybaseLlmClient/clientLlmUtilitiesandStreamProcessor), not only via bucket failover.Because all three retry consumers delegate to
isOverloadError(), this single change closes both the loop-break path and the compression-failure path described in the issue.Tests (behavioral, RED-verified before the fix)
retry.test.ts:isOverloadErroracross raw and SDK-wrapped shapes for all three types, negative cases (deterministic types + bare envelope + null/undefined/string), and a behavioralretryWithBackoffdefault-predicate retry-then-succeed for an SDK-wrappedapi_error.RetryOrchestrator.test.ts: behavioral retry-then-succeed on an SDK-wrappedapi_error.compression-retry.test.ts:isTransientCompressionErrorfor raw and SDK-wrappedapi_error, plus aperformCompressionretry-then-succeed.Verification