You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deferred from #18. With #18, each councillor gets a single contiguous budget (councillor_ms, default 270s) and no retry. This is the right fix for the dominant failure mode (slow-but-working models), but it removes the legitimate recovery case: genuine transient infra errors (HTTP 5xx, 429, ECONNRESET, socket hang-ups) that would have succeeded on a second attempt.
Goal
Within the existing councillor_ms budget, retry only on classifiable transient errors. Never retry on timeout, never retry on permanent errors (auth, 4xx-not-429, validation).
Prerequisite work
The current error pipeline collapses everything to strings before runCouncillor's catch sees it:
src/session.tscreateChildSession: Error(failed to create child session: ${error}) — template interpolation, so object errors become [object Object].
src/session.tspromptAndExtract prompt call: Error(prompt failed: ${JSON.stringify(error)}) — structure preserved but only as a string.
src/session.tspromptAndExtract messages call: same [object Object] problem.
src/timeout.tsraceWithTimeout: Error( timed out after Xs) — string-only, intentionally distinguishable.
#18 includes a small cleanup: switch the two ${error} interpolations to JSON.stringify(error) so logs surface real structure. This issue should build on that and introduce typed errors:
Throw the typed error from session.ts so runCouncillor can inspect error.sdkError instead of regex-matching messages.
Proposed classifier
typeErrorClass="timeout"|"transient"|"permanent"|"unknown";functionclassify(err: unknown): ErrorClass{// timeout → message matches /timed out after/ → never retry// transient → HTTP 5xx, 429, ECONNRESET, ETIMEDOUT, socket hang up → retry once with short backoff// permanent → HTTP 4xx (not 429), validation, auth → never retry// unknown → anything else → don't retry (conservative)}
Behavior
Single councillor_ms budget governs total time across attempts.
On error, classify. If transient and remaining budget > min_retry_budget_ms (e.g., 30s), sleep backoffMs (e.g., 1-2s), then re-attempt within remaining budget.
Track elapsed time so the retry never overruns councillor_ms.
Cap at one retry. Never recurse.
Scope
src/session.ts: introduce CouncilSDKError; throw typed errors instead of stringified.
src/councillor.ts: classifier function; budget-aware retry on transient only.
What's the source of HTTP status codes from @opencode-ai/plugin errors? Needs investigation against the actual SDK error shape — without that, the classifier is guessing.
Backoff: fixed (e.g., 1s) or jittered? Single retry probably doesn't need jitter.
Should unknown errors retry once (optimistic) or never (conservative)? Recommend conservative — unknown means we couldn't classify, so we shouldn't burn budget on a guess.
Sequencing
Recommended: ship #18 first, collect a week of real-world council logs (with the JSON.stringify cleanup making error shapes visible), then design this classifier informed by actual transient error shapes observed.
Context
Deferred from #18. With #18, each councillor gets a single contiguous budget (
councillor_ms, default 270s) and no retry. This is the right fix for the dominant failure mode (slow-but-working models), but it removes the legitimate recovery case: genuine transient infra errors (HTTP 5xx, 429,ECONNRESET, socket hang-ups) that would have succeeded on a second attempt.Goal
Within the existing
councillor_msbudget, retry only on classifiable transient errors. Never retry on timeout, never retry on permanent errors (auth, 4xx-not-429, validation).Prerequisite work
The current error pipeline collapses everything to strings before
runCouncillor's catch sees it:src/session.tscreateChildSession:Error(failed to create child session: ${error})— template interpolation, so object errors become[object Object].src/session.tspromptAndExtractprompt call:Error(prompt failed: ${JSON.stringify(error)})— structure preserved but only as a string.src/session.tspromptAndExtractmessages call: same[object Object]problem.src/timeout.tsraceWithTimeout:Error(timed out after Xs)— string-only, intentionally distinguishable.#18 includes a small cleanup: switch the two
${error}interpolations toJSON.stringify(error)so logs surface real structure. This issue should build on that and introduce typed errors:Throw the typed error from session.ts so
runCouncillorcan inspecterror.sdkErrorinstead of regex-matching messages.Proposed classifier
Behavior
councillor_msbudget governs total time across attempts.transientand remaining budget >min_retry_budget_ms(e.g., 30s), sleepbackoffMs(e.g., 1-2s), then re-attempt within remaining budget.councillor_ms.Scope
src/session.ts: introduceCouncilSDKError; throw typed errors instead of stringified.src/councillor.ts: classifier function; budget-aware retry on transient only.src/types.ts: re-introduceCouncillorSuccess.attempts(removed in feat: collapse councillor retry into single extended budget #18) if we want to track this in the aggregator participation summary. Optional — could also be logged-only.src/timeout.ts:raceWithTimeoutmay need to expose remaining-budget info to the caller, orrunCouncillortracks elapsed manually.Open design questions
@opencode-ai/pluginerrors? Needs investigation against the actual SDK error shape — without that, the classifier is guessing.attempts: numberfromCouncillorSuccess.)unknownerrors retry once (optimistic) or never (conservative)? Recommend conservative —unknownmeans we couldn't classify, so we shouldn't burn budget on a guess.Sequencing
Recommended: ship #18 first, collect a week of real-world council logs (with the
JSON.stringifycleanup making error shapes visible), then design this classifier informed by actual transient error shapes observed.Related