Problem
The runtime live-scenario CI lanes launch a real headless host agent that drives a multi-stage workflow, guarded by a no-progress stall-watchdog (it kills the run if the host emits no stream output for a fixed budget, reporting "made no stream progress within {budget} — a hung stage").
When the provider API is overloaded (HTTP 529 "overloaded", 5xx, rate-limiting), the host SDK backs off and retries — and emits no stream output during the backoff. A backoff that exceeds the watchdog budget trips the watchdog, which kills the run and labels it a hung stage. This misattributes a transient provider-API outage as a workflow/code hang: the CI failure looks like a regression when it is an infrastructure issue, sending the next debugger down the wrong path.
The watchdog cannot currently distinguish "the agent is genuinely stuck" from "the agent's API call is backing off on an overloaded provider" — both present as zero stream output.
Evidence (observed)
- A live lane failed with
made no stream progress within 1m0s (no-progress quiet budget) — a hung stage; killed the subprocess.
- The captured stream from that run contained a 529 (overloaded) + two 500s + ~19 retries.
- A parallel live lane on a different provider's API passed clean on the same code/commit.
- An earlier run on the same code and the same watchdog budget passed (594s) when the provider API was not overloaded — i.e. the budget is not the root cause; the overload is.
Impact
- Live CI is flaky under provider-API overload, and the failure message is misleading (reports a hang/regression, not "provider overloaded").
- Wasted debugging chasing a non-existent code regression; erodes trust in the live gate.
Ask
Make a provider-API-caused failure obvious and distinct from a genuine hang/regression. Candidate approaches (not prescriptive):
- Detect provider-overload signals (529 / 5xx / rate-limit / retry markers) in the captured stream and label the failure explicitly, e.g.
LIVE LANE FAILED: provider API overload (transient) — N retries, 529 observed; not a workflow hang.
- While a retry/backoff is in flight, pause or extend the no-progress watchdog (so an overload backoff does not read as a hang).
- Classify an overload-attributed failure as a re-runnable infra-flake, distinct from a test/assertion failure, so CI/operators treat it accordingly.
At minimum (1) — surfacing the provider-overload attribution in the failure output — would remove the misdiagnosis.
Context
- Scale: multi-stage live workflow scenarios, run as separate per-host CI lanes (2–3 host lanes), each driving a real headless agent end-to-end.
- The no-progress watchdog is the per-stage liveness guard for these lanes.
- Spacedock version: 0.19.x (workflow commissioned by an earlier 0.13.0-dev).
Problem
The runtime live-scenario CI lanes launch a real headless host agent that drives a multi-stage workflow, guarded by a no-progress stall-watchdog (it kills the run if the host emits no stream output for a fixed budget, reporting "made no stream progress within {budget} — a hung stage").
When the provider API is overloaded (HTTP 529 "overloaded", 5xx, rate-limiting), the host SDK backs off and retries — and emits no stream output during the backoff. A backoff that exceeds the watchdog budget trips the watchdog, which kills the run and labels it a hung stage. This misattributes a transient provider-API outage as a workflow/code hang: the CI failure looks like a regression when it is an infrastructure issue, sending the next debugger down the wrong path.
The watchdog cannot currently distinguish "the agent is genuinely stuck" from "the agent's API call is backing off on an overloaded provider" — both present as zero stream output.
Evidence (observed)
made no stream progress within 1m0s (no-progress quiet budget) — a hung stage; killed the subprocess.Impact
Ask
Make a provider-API-caused failure obvious and distinct from a genuine hang/regression. Candidate approaches (not prescriptive):
LIVE LANE FAILED: provider API overload (transient) — N retries, 529 observed; not a workflow hang.At minimum (1) — surfacing the provider-overload attribution in the failure output — would remove the misdiagnosis.
Context