Live CI failure misattributed as a hang when caused by provider API overload (529/5xx/retry stall trips the no-progress watchdog)

## Problem

The runtime live-scenario CI lanes launch a real headless host agent that drives a multi-stage workflow, guarded by a **no-progress stall-watchdog** (it kills the run if the host emits no stream output for a fixed budget, reporting "made no stream progress within {budget} — a hung stage"). 

When the **provider API is overloaded** (HTTP 529 "overloaded", 5xx, rate-limiting), the host SDK backs off and retries — and emits **no stream output during the backoff**. A backoff that exceeds the watchdog budget trips the watchdog, which kills the run and labels it a **hung stage**. This **misattributes a transient provider-API outage as a workflow/code hang**: the CI failure looks like a regression when it is an infrastructure issue, sending the next debugger down the wrong path.

The watchdog cannot currently distinguish "the agent is genuinely stuck" from "the agent's API call is backing off on an overloaded provider" — both present as zero stream output.

## Evidence (observed)

- A live lane failed with `made no stream progress within 1m0s (no-progress quiet budget) — a hung stage; killed the subprocess`.
- The captured stream from that run contained a **529 (overloaded) + two 500s + ~19 retries**.
- A parallel live lane on a **different provider's API** passed clean on the **same code/commit**.
- An **earlier run on the same code and the same watchdog budget passed** (594s) when the provider API was not overloaded — i.e. the budget is not the root cause; the overload is.

## Impact

- Live CI is flaky under provider-API overload, and the failure message is **misleading** (reports a hang/regression, not "provider overloaded").
- Wasted debugging chasing a non-existent code regression; erodes trust in the live gate.

## Ask

Make a provider-API-caused failure **obvious and distinct from a genuine hang/regression**. Candidate approaches (not prescriptive):
1. Detect provider-overload signals (529 / 5xx / rate-limit / retry markers) in the captured stream and **label the failure** explicitly, e.g. `LIVE LANE FAILED: provider API overload (transient) — N retries, 529 observed; not a workflow hang`.
2. While a retry/backoff is in flight, **pause or extend** the no-progress watchdog (so an overload backoff does not read as a hang).
3. **Classify** an overload-attributed failure as a re-runnable infra-flake, distinct from a test/assertion failure, so CI/operators treat it accordingly.

At minimum (1) — surfacing the provider-overload attribution in the failure output — would remove the misdiagnosis.

## Context

- Scale: multi-stage live workflow scenarios, run as separate per-host CI lanes (2–3 host lanes), each driving a real headless agent end-to-end.
- The no-progress watchdog is the per-stage liveness guard for these lanes.
- Spacedock version: 0.19.x (workflow commissioned by an earlier 0.13.0-dev).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live CI failure misattributed as a hang when caused by provider API overload (529/5xx/retry stall trips the no-progress watchdog) #307

Problem

Evidence (observed)

Impact

Ask

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Live CI failure misattributed as a hang when caused by provider API overload (529/5xx/retry stall trips the no-progress watchdog) #307

Description

Problem

Evidence (observed)

Impact

Ask

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions