Skip to content

feat: transient-error classifier for councillor in-budget retry #19

Description

@skwid138

Context

Deferred from #18. With #18, each councillor gets a single contiguous budget (councillor_ms, default 270s) and no retry. This is the right fix for the dominant failure mode (slow-but-working models), but it removes the legitimate recovery case: genuine transient infra errors (HTTP 5xx, 429, ECONNRESET, socket hang-ups) that would have succeeded on a second attempt.

Goal

Within the existing councillor_ms budget, retry only on classifiable transient errors. Never retry on timeout, never retry on permanent errors (auth, 4xx-not-429, validation).

Prerequisite work

The current error pipeline collapses everything to strings before runCouncillor's catch sees it:

  • src/session.ts createChildSession: Error(failed to create child session: ${error}) — template interpolation, so object errors become [object Object].
  • src/session.ts promptAndExtract prompt call: Error(prompt failed: ${JSON.stringify(error)}) — structure preserved but only as a string.
  • src/session.ts promptAndExtract messages call: same [object Object] problem.
  • src/timeout.ts raceWithTimeout: Error( timed out after Xs) — string-only, intentionally distinguishable.

#18 includes a small cleanup: switch the two ${error} interpolations to JSON.stringify(error) so logs surface real structure. This issue should build on that and introduce typed errors:

class CouncilSDKError extends Error {
  constructor(message: string, public readonly sdkError: unknown) { super(message); }
}

Throw the typed error from session.ts so runCouncillor can inspect error.sdkError instead of regex-matching messages.

Proposed classifier

type ErrorClass = "timeout" | "transient" | "permanent" | "unknown";

function classify(err: unknown): ErrorClass {
  // timeout  → message matches /timed out after/  → never retry
  // transient → HTTP 5xx, 429, ECONNRESET, ETIMEDOUT, socket hang up → retry once with short backoff
  // permanent → HTTP 4xx (not 429), validation, auth → never retry
  // unknown  → anything else → don't retry (conservative)
}

Behavior

  • Single councillor_ms budget governs total time across attempts.
  • On error, classify. If transient and remaining budget > min_retry_budget_ms (e.g., 30s), sleep backoffMs (e.g., 1-2s), then re-attempt within remaining budget.
  • Track elapsed time so the retry never overruns councillor_ms.
  • Cap at one retry. Never recurse.

Scope

  • src/session.ts: introduce CouncilSDKError; throw typed errors instead of stringified.
  • src/councillor.ts: classifier function; budget-aware retry on transient only.
  • src/types.ts: re-introduce CouncillorSuccess.attempts (removed in feat: collapse councillor retry into single extended budget #18) if we want to track this in the aggregator participation summary. Optional — could also be logged-only.
  • src/timeout.ts: raceWithTimeout may need to expose remaining-budget info to the caller, or runCouncillor tracks elapsed manually.
  • Tests: per-class retry behavior, budget accounting, no-retry-after-timeout.

Open design questions

  • What's the source of HTTP status codes from @opencode-ai/plugin errors? Needs investigation against the actual SDK error shape — without that, the classifier is guessing.
  • Should retry count be surfaced in the aggregator prompt? (feat: collapse councillor retry into single extended budget #18 dropped attempts: number from CouncillorSuccess.)
  • Backoff: fixed (e.g., 1s) or jittered? Single retry probably doesn't need jitter.
  • Should unknown errors retry once (optimistic) or never (conservative)? Recommend conservative — unknown means we couldn't classify, so we shouldn't burn budget on a guess.

Sequencing

Recommended: ship #18 first, collect a week of real-world council logs (with the JSON.stringify cleanup making error shapes visible), then design this classifier informed by actual transient error shapes observed.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/featureNew feature or capability request.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions