You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Window: 2026-06-09 19:13–21:41Z (~2.5h evening cluster, 38 in-window runs — logs tool timed out at 97s so this is a partial window). Engines: copilot 27, claude 9, codex 2.
🟢 Headline — yesterday's CRITICAL production incident is RESOLVED. The copilot-pat-not-supported-400 incident ("Personal Access Tokens are not supported for this endpoint"), which failed ~15 runs yesterday, is closed. Four runs still carry the signature in their logs, but all four concluded success on attempt 1 — the harness now tolerates/retries past the token-check 400 (fix branch pelikhan/fix-pat-400-retry landed). Zero PAT-400-attributable failures this window.
Metric
Value
Success / Failure
31 / 6 (+1 in-progress self)
Success rate
83.8% (30-day avg 83.6%)
Tokens
23.8M · AIC 5,191.79
Turns / Action-min
566 / 451
Firewall blocked
6 / 2,304 = 0.26% (clean — only proxy.golang.org, by-design)
token-429 effective-cap
ABSENT ✅
Every one of the 6 failures was copilot-engine. Claude ran 9/9 clean and codex 2/2 clean.
Critical / New This Window
🆕 daily-ai-credits-cap-429 — Daily Ambient Context Optimizer (prod-main, run §27233269431) ran 57 turns / 3.0M tokens, then the Copilot API returned CAPIError 429 Maximum AI credits exceeded (1026.78 / 1000) after 5 retries → failure. This is the account daily-credit cap (~1000), distinct from the 25M effective-token cap (which was absent). The heavy aggregator crosses the cap late in its run, wasting all prior work.
🆕 copilot-harness-15min-action-timeout — Test Quality Sentinel (PR branch, run §27234356749) produced 20 turns / 649k tokens of valid review, then Execute GitHub Copilot CLI timed out after 15 minutes → reddened with no emitted output. Sibling TQS runs on other branches succeeded.
Recurring Issues
View 4 recurring failures
cli-proxy-difc-liveness-probe-failed (recur day 2, infra) — liveness probe failed for localhost:18443 (gh api exit=0); agent never starts (turns=0). Hit Issue Monster (§27229956123) and PR Sous Chef (§27229990045). Intermittent — other Sous Chef main runs succeeded the same window.
copilot-sdk-driver-failures / tool-perm-lockout (recur) — Daily Safe Output Integrator (§27230190112): 11 permission-denied, turns=1. This was also the window's single missing_tool (tool/permission).
Success rate sits at 83.8%, essentially on the 30-day mean of 83.6% and recovering from yesterday's PAT-incident dip. The failure stack is now small and dominated by copilot infra/timeout classes rather than a single systemic incident — a healthier shape than the past week.
Token Usage (30 days)
Daily tokens (23.8M) are below the 7-day moving average and well under the ~37M 30-day average — partly because this is a short partial window. No effective-token cap pressure this window.
Recommendations
Add a soft AI-credits pre-cap guard to heavy daily aggregators (esp. Daily Ambient Context Optimizer): check remaining daily credits before the heaviest turns and degrade gracefully (partial output + noop) rather than 429-aborting at turn 57 and wasting 3M tokens. Alternatively raise the cap for the few heavy prod-main aggregators or shard their work.
Checkpoint copilot review output / raise the step timeout for review-heavy workflows so a 15-min action timeout still emits the buffered review instead of reddening with nothing (Test Quality Sentinel).
Add retry/backoff to the DIFC localhost:18443 liveness probe before declaring the cli-proxy dead — it is intermittently flaky and zeroes out otherwise-healthy runs.
Close out the now-resolved rec-pat-400-rollback-or-fix and keep pelikhan/fix-pat-400-retry merged on main.
Notes
All 6 failures were copilot-engine; claude and codex were 100% clean. The window is partial (logs tool timeout), so absolute counts undercount the full 24h. Repo memory updated: PAT-400 marked RESOLVED, two new issue classes recorded, recurrence counters bumped for the cli-proxy and sdk-driver families.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Window: 2026-06-09 19:13–21:41Z (~2.5h evening cluster, 38 in-window runs — logs tool timed out at 97s so this is a partial window). Engines: copilot 27, claude 9, codex 2.
🟢 Headline — yesterday's CRITICAL production incident is RESOLVED. The
copilot-pat-not-supported-400incident ("Personal Access Tokens are not supported for this endpoint"), which failed ~15 runs yesterday, is closed. Four runs still carry the signature in their logs, but all four concludedsuccesson attempt 1 — the harness now tolerates/retries past the token-check 400 (fix branchpelikhan/fix-pat-400-retrylanded). Zero PAT-400-attributable failures this window.proxy.golang.org, by-design)Every one of the 6 failures was copilot-engine. Claude ran 9/9 clean and codex 2/2 clean.
Critical / New This Window
daily-ai-credits-cap-429— Daily Ambient Context Optimizer (prod-main, run §27233269431) ran 57 turns / 3.0M tokens, then the Copilot API returnedCAPIError 429 Maximum AI credits exceeded (1026.78 / 1000)after 5 retries → failure. This is the account daily-credit cap (~1000), distinct from the 25M effective-token cap (which was absent). The heavy aggregator crosses the cap late in its run, wasting all prior work.copilot-harness-15min-action-timeout— Test Quality Sentinel (PR branch, run §27234356749) produced 20 turns / 649k tokens of valid review, thenExecute GitHub Copilot CLI timed out after 15 minutes→ reddened with no emitted output. Sibling TQS runs on other branches succeeded.Recurring Issues
View 4 recurring failures
cli-proxy-difc-liveness-probe-failed(recur day 2, infra) —liveness probe failed for localhost:18443 (gh api exit=0); agent never starts (turns=0). Hit Issue Monster (§27229956123) and PR Sous Chef (§27229990045). Intermittent — other Sous Chef main runs succeeded the same window.copilot-sdk-driver-failures/ tool-perm-lockout (recur) — Daily Safe Output Integrator (§27230190112): 11 permission-denied, turns=1. This was also the window's singlemissing_tool(tool/permission).copilot-sdk-driver-failures/ session.idle (recur) — PR Code Quality Reviewer (§27233924083) PR branch, turns=1.📊 Trends
Workflow Health (30 days)
Success rate sits at 83.8%, essentially on the 30-day mean of 83.6% and recovering from yesterday's PAT-incident dip. The failure stack is now small and dominated by copilot infra/timeout classes rather than a single systemic incident — a healthier shape than the past week.
Token Usage (30 days)
Daily tokens (23.8M) are below the 7-day moving average and well under the ~37M 30-day average — partly because this is a short partial window. No effective-token cap pressure this window.
Recommendations
localhost:18443liveness probe before declaring the cli-proxy dead — it is intermittently flaky and zeroes out otherwise-healthy runs.rec-pat-400-rollback-or-fixand keeppelikhan/fix-pat-400-retrymerged on main.Notes
All 6 failures were copilot-engine; claude and codex were 100% clean. The window is partial (logs tool timeout), so absolute counts undercount the full 24h. Repo memory updated: PAT-400 marked RESOLVED, two new issue classes recorded, recurrence counters bumped for the cli-proxy and sdk-driver families.
References: §27233269431 · §27234356749 · §27230190112
Beta Was this translation helpful? Give feedback.
All reactions