cbusillo · cbusillo · May 1, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026
diff --git a/.codex/skills/babysit-pr/SKILL.md b/.codex/skills/babysit-pr/SKILL.md
@@ -27,10 +27,10 @@ Accept any of the following:
 2. Run the watcher script to snapshot PR/review/CI state (or consume each streamed snapshot from `--watch`).
 3. Inspect the `actions` list in the JSON response.
 4. If `diagnose_ci_failure` is present, inspect failed run logs and classify the failure.
-5. If the failure is likely caused by the current branch, patch code locally, commit, and push.
+5. If the failure is likely caused by the current branch, patch code locally, commit, and push. Do not patch random flaky tests, CI infrastructure, dependency outages, runner issues, or other failures that are unrelated to the branch.
 6. If `process_review_comment` is present, inspect surfaced review items and decide whether to address them.
 7. If a review item is actionable and correct, patch code locally, commit, push, and then mark the associated review thread/comment as resolved once the fix is on GitHub.
-8. If a review item from another author is non-actionable, already addressed, or not valid, post one reply on the comment/thread explaining that decision (for example answering the question or explaining why no change is needed). Prefix the GitHub reply body with `[codex]` so it is clear the response is automated. If the watcher later surfaces your own reply, treat that self-authored item as already handled and do not reply again.
+8. Do not post replies to human-authored review comments/threads unless the user explicitly confirms the exact response. If a human review item is non-actionable, already addressed, or not valid, surface the item and recommended response to the user instead of replying on GitHub.
 9. If the failure is likely flaky/unrelated and `retry_failed_checks` is present, rerun failed jobs with `--retry-failed-now`.
 10. If both actionable review feedback and `retry_failed_checks` are present, prioritize review feedback first; a new commit will retrigger CI, so avoid rerunning flaky checks on the old SHA unless you intentionally defer the review change.
 11. On every loop, look for newly surfaced review feedback before acting on CI failures or mergeability state, then verify mergeability / merge-conflict status (for example via `gh pr view`) alongside CI.
@@ -69,12 +69,18 @@ python3 .codex/skills/babysit-pr/scripts/gh_pr_watch.py --pr <number-or-url> --o
 Use `gh` commands to inspect failed runs before deciding to rerun.
 
 - `gh run view <run-id> --json jobs,name,workflowName,conclusion,status,url,headSha`
-- `gh run view <run-id> --log-failed`
+- `gh api repos/<owner>/<repo>/actions/runs/<run-id>/jobs -X GET -f per_page=100`
+- `gh api repos/<owner>/<repo>/actions/jobs/<job-id>/logs > /tmp/codex-gh-job-<job-id>-logs.zip`
+- `gh run view <run-id> --log-failed` as a fallback after the overall workflow run is complete
 
-Prefer treating failures as branch-related when logs point to changed code (compile/test/lint/typecheck/snapshots/static analysis in touched areas).
+`gh run view --log-failed` is workflow-run scoped and may not expose failed-job logs until the overall run finishes. For faster diagnosis, poll the run's jobs first and, as soon as a specific job has failed, fetch that job's logs directly from the Actions job logs endpoint. The watcher includes a `failed_jobs` list with each failed job's `job_id` and `logs_endpoint` when GitHub exposes one.
+
+Prefer treating failures as branch-related when failed-job logs point to changed code (compile/test/lint/typecheck/snapshots/static analysis in touched areas).
 
 Prefer treating failures as flaky/unrelated when logs show transient infra/external issues (timeouts, runner provisioning failures, registry/network outages, GitHub Actions infra errors).
 
+Do not attempt to fix flaky/unrelated failures by changing tests, build scripts, CI configuration, dependency pins, or infrastructure-adjacent code unless the logs clearly connect the failure to the PR branch. For flaky/unrelated failures, rerun only when the watcher recommends `retry_failed_checks`; otherwise wait or stop for user help.
+
 If classification is ambiguous, perform one manual diagnosis attempt before choosing rerun.
 
 Read `.codex/skills/babysit-pr/references/heuristics.md` for a concise checklist.
@@ -99,7 +105,8 @@ When you agree with a comment and it is actionable:
 5. Resume watching on the new SHA immediately (do not stop after reporting the push).
 6. If monitoring was running in `--watch` mode, restart `--watch` immediately after the push in the same turn; do not wait for the user to ask again.
 
-If you disagree or the comment is non-actionable/already addressed, reply once directly on the GitHub comment/thread so the reviewer gets an explicit answer, then continue the watcher loop. Prefix any GitHub reply to a code review comment/thread with `[codex]` so it is clear the response is automated and not from the human user. If the watcher later surfaces your own reply because the authenticated operator is treated as a trusted review author, treat that self-authored item as already handled and do not reply again.
+Do not post replies to human-authored GitHub review comments/threads automatically. If you disagree with a human comment, believe it is non-actionable/already addressed, or need to answer a question, report the item to the user with a suggested response and wait for explicit confirmation before posting anything on GitHub. If the user approves a response, prefix it with `[codex]` so it is clear the response is automated and not from the human user.
+If the watcher later surfaces your own approved reply because the authenticated operator is treated as a trusted review author, treat that self-authored item as already handled and do not reply again.
 If a code review comment/thread is already marked as resolved in GitHub, treat it as non-actionable and safely ignore it unless new unresolved follow-up feedback appears.
 
 ## Git Safety Rules
@@ -125,11 +132,11 @@ Use this loop in a live Codex session:
 2. Read `actions`.
 3. First check whether the PR is now merged or otherwise closed; if so, report that terminal state and stop polling immediately.
 4. Check CI summary, new review items, and mergeability/conflict status.
-5. Diagnose CI failures and classify branch-related vs flaky/unrelated.
-6. For each surfaced review item from another author, either reply once with an explanation if it is non-actionable or patch/commit/push and then resolve it if it is actionable. If a later snapshot surfaces your own reply, treat it as informational and continue without responding again.
+5. Diagnose CI failures and classify branch-related vs flaky/unrelated. If the overall run is still pending but `failed_jobs` already includes a failed job, fetch that job's logs and diagnose immediately instead of waiting for the whole workflow run to finish. Patch only when the failure is branch-related.
+6. For each surfaced review item from another author, patch/commit/push and then resolve it if it is actionable. If it is non-actionable, already addressed, or requires a written answer, surface it to the user with a suggested response instead of posting automatically. If a later snapshot surfaces your own approved reply, treat it as informational and continue without responding again.
 7. Process actionable review comments before flaky reruns when both are present; if a review fix requires a commit, push it and skip rerunning failed checks on the old SHA.
-8. Retry failed checks only when `retry_failed_checks` is present and you are not about to replace the current SHA with a review/CI fix commit.
-9. If you pushed a commit, resolved a review thread, replied to a review comment, or triggered a rerun, report the action briefly and continue polling (do not stop).
+8. Retry failed checks only when `retry_failed_checks` is present and you are not about to replace the current SHA with a review/CI fix commit. Do not make code changes for unrelated flakes or infrastructure failures just to get CI green.
+9. If you pushed a commit, resolved a review thread, or triggered a rerun, report the action briefly and continue polling (do not stop). If a human review comment needs a written GitHub response, stop and ask for confirmation before posting.
 10. After a review-fix push, proactively restart continuous monitoring (`--watch`) in the same turn unless a strict stop condition has already been reached.
 11. If everything is passing, mergeable, not blocked on required review approval, and there are no unaddressed review items, report that the PR is currently ready to merge but keep the watcher running so new review comments are surfaced quickly while the PR remains open.
 12. If blocked on a user-help-required issue (infra outage, exhausted flaky retries, unclear reviewer request, permissions), report the blocker and stop.

diff --git a/.codex/skills/babysit-pr/agents/openai.yaml b/.codex/skills/babysit-pr/agents/openai.yaml
@@ -1,4 +1,4 @@
 interface:
   display_name: "PR Babysitter"
   short_description: "Watch PR review comments, CI, and merge conflicts"
-  default_prompt: "Babysit the current PR: monitor reviewer comments, CI, and merge-conflict status (prefer the watcher’s --watch mode for live monitoring); surface new review feedback before acting on CI or mergeability work, fix valid issues, push updates, and rerun flaky failures up to 3 times. Keep exactly one watcher session active for the PR (do not leave duplicate --watch terminals running). If you pause monitoring to patch review/CI feedback, restart --watch yourself immediately after the push in the same turn. If a watcher is still running and no strict stop condition has been reached, the task is still in progress: keep consuming watcher output and sending progress updates instead of ending the turn. Do not treat a green + mergeable PR as a terminal stop while it is still open; continue polling autonomously after any push/rerun so newly posted review comments are surfaced until a strict terminal stop condition is reached or the user interrupts."
+  default_prompt: "Babysit the current PR: monitor reviewer comments, CI, and merge-conflict status (prefer the watcher’s --watch mode for live monitoring); surface new review feedback before acting on CI or mergeability work, fix valid issues, push updates, and rerun flaky failures up to 3 times. Do not post replies to human-authored review comments unless the user explicitly confirms the exact response. Do not patch unrelated flaky tests, CI infrastructure, dependency outages, runner issues, or other failures that are not caused by the branch. Keep exactly one watcher session active for the PR (do not leave duplicate --watch terminals running). If you pause monitoring to patch review/CI feedback, restart --watch yourself immediately after the push in the same turn. If a watcher is still running and no strict stop condition has been reached, the task is still in progress: keep consuming watcher output and sending progress updates instead of ending the turn. Do not treat a green + mergeable PR as a terminal stop while it is still open; continue polling autonomously after any push/rerun so newly posted review comments are surfaced until a strict terminal stop condition is reached or the user interrupts."
diff --git a/.codex/skills/babysit-pr/references/github-api-notes.md b/.codex/skills/babysit-pr/references/github-api-notes.md
@@ -23,9 +23,11 @@ Used to discover failed workflow runs and rerunnable run IDs.
 ### Failed log inspection
 
 - `gh run view <run-id> --json jobs,name,workflowName,conclusion,status,url,headSha`
+- `gh api repos/{owner}/{repo}/actions/runs/{run_id}/jobs -X GET -f per_page=100`
+- `gh api repos/{owner}/{repo}/actions/jobs/{job_id}/logs > /tmp/codex-gh-job-{job_id}-logs.zip`
 - `gh run view <run-id> --log-failed`
 
-Used by Codex to classify branch-related vs flaky/unrelated failures.
+Used by Codex to classify branch-related vs flaky/unrelated failures. Prefer the direct job log endpoint as soon as a job has failed because `gh run view --log-failed` may not produce failed-job logs until the overall workflow run completes.
 
 ### Retry failed jobs only
 
@@ -70,3 +72,11 @@ Reruns only failed jobs (and dependencies) for a workflow run.
 - `conclusion`
 - `html_url`
 - `head_sha`
+
+### Actions run jobs API (`jobs[]`)
+
+- `id`
+- `name`
+- `status`
+- `conclusion`
+- `html_url`
diff --git a/.codex/skills/babysit-pr/references/heuristics.md b/.codex/skills/babysit-pr/references/heuristics.md
@@ -18,16 +18,20 @@ Treat as **likely flaky or unrelated** when evidence points to transient or exte
 - Cloud/service rate limits or transient API outages
 - Non-deterministic failures in unrelated integration tests with known flake patterns
 
+Do not patch likely flaky/unrelated failures. Use the retry budget for rerunnable failures, wait for pending jobs, or stop and report the blocker when the failure is persistent or infrastructure-owned.
+
 If uncertain, inspect failed logs once before choosing rerun.
 
 ## Decision tree (fix vs rerun vs stop)
 
 1. If PR is merged/closed: stop.
 2. If there are failed checks:
    - Diagnose first.
+   - If checks are still pending but an individual job has already failed: fetch that job's logs and diagnose now.
    - If branch-related: fix locally, commit, push.
    - If likely flaky/unrelated and all checks for the current SHA are terminal: rerun failed jobs.
-   - If checks are still pending: wait.
+   - If likely flaky/unrelated and not safely rerunnable: stop and report the blocker; do not edit unrelated tests, build scripts, CI configuration, dependency pins, or infrastructure code.
+   - If checks are still pending and no failed job is available yet: wait.
 3. If flaky reruns for the same SHA reach the configured limit (default 3): stop and report persistent failure.
 4. Independently, process any new human review comments.
 
@@ -40,12 +44,15 @@ Address the comment when:
 - The requested change does not conflict with the user’s intent or recent guidance.
 - The change can be made safely without unrelated refactors.
 
+Fix valid human review feedback in code when possible, but do not post a GitHub reply to a human-authored comment/thread unless the user explicitly confirms the exact response.
+
 Do not auto-fix when:
 
 - The comment is ambiguous and needs clarification.
 - The request conflicts with explicit user instructions.
 - The proposed change requires product/design decisions the user has not made.
 - The codebase is in a dirty/unrelated state that makes safe editing uncertain.
+- The comment only needs a written answer or disagreement response; propose the reply to the user instead of posting it automatically.
 
 ## Stop-and-ask conditions
 
@@ -56,3 +63,4 @@ Stop and ask the user instead of continuing automatically when:
 - The PR branch cannot be pushed.
 - CI failures persist after the flaky retry budget.
 - Reviewer feedback requires a product decision or cross-team coordination.
+- A human review comment requires a written GitHub reply instead of a code change.
diff --git a/.codex/skills/babysit-pr/scripts/gh_pr_watch.py b/.codex/skills/babysit-pr/scripts/gh_pr_watch.py
@@ -338,6 +338,66 @@ def failed_runs_from_workflow_runs(runs, head_sha):
     return failed_runs
 
 
+def get_jobs_for_run(repo, run_id):
+    endpoint = f"repos/{repo}/actions/runs/{run_id}/jobs"
+    data = gh_json(["api", endpoint, "-X", "GET", "-f", "per_page=100"], repo=repo)
+    if not isinstance(data, dict):
+        raise GhCommandError("Unexpected payload from actions run jobs API")
+    jobs = data.get("jobs") or []
+    if not isinstance(jobs, list):
+        raise GhCommandError("Expected `jobs` to be a list")
+    return jobs
+
+
+def failed_jobs_from_workflow_runs(repo, runs, head_sha):
+    failed_jobs = []
+    for run in runs:
+        if not isinstance(run, dict):
+            continue
+        if str(run.get("head_sha") or "") != head_sha:
+            continue
+        run_id = run.get("id")
+        if run_id in (None, ""):
+            continue
+        run_status = str(run.get("status") or "")
+        run_conclusion = str(run.get("conclusion") or "")
+        if run_status.lower() == "completed" and run_conclusion not in FAILED_RUN_CONCLUSIONS:
+            continue
+        jobs = get_jobs_for_run(repo, run_id)
+        for job in jobs:
+            if not isinstance(job, dict):
+                continue
+            conclusion = str(job.get("conclusion") or "")
+            if conclusion not in FAILED_RUN_CONCLUSIONS:
+                continue
+            job_id = job.get("id")
+            logs_endpoint = None
+            if job_id not in (None, ""):
+                logs_endpoint = f"repos/{repo}/actions/jobs/{job_id}/logs"
+            failed_jobs.append(
+                {
+                    "run_id": run_id,
+                    "workflow_name": run.get("name") or run.get("display_title") or "",
+                    "run_status": run_status,
+                    "run_conclusion": run_conclusion,
+                    "job_id": job_id,
+                    "job_name": str(job.get("name") or ""),
+                    "status": str(job.get("status") or ""),
+                    "conclusion": conclusion,
+                    "html_url": str(job.get("html_url") or ""),
+                    "logs_endpoint": logs_endpoint,
+                }
+            )
+    failed_jobs.sort(
+        key=lambda item: (
+            str(item.get("workflow_name") or ""),
+            str(item.get("job_name") or ""),
+            str(item.get("job_id") or ""),
+        )
+    )
+    return failed_jobs
+
+
 def get_authenticated_login():
     data = gh_json(["api", "user"])
     if not isinstance(data, dict) or not data.get("login"):
@@ -568,7 +628,7 @@ def is_pr_ready_to_merge(pr, checks_summary, new_review_items):
     return True
 
 
-def recommend_actions(pr, checks_summary, failed_runs, new_review_items, retries_used, max_retries):
+def recommend_actions(pr, checks_summary, failed_runs, failed_jobs, new_review_items, retries_used, max_retries):
     actions = []
     if pr["closed"] or pr["merged"]:
         if new_review_items:
@@ -583,7 +643,7 @@ def recommend_actions(pr, checks_summary, failed_runs, new_review_items, retries
     if new_review_items:
         actions.append("process_review_comment")
 
-    has_failed_pr_checks = checks_summary["failed_count"] > 0
+    has_failed_pr_checks = checks_summary["failed_count"] > 0 or bool(failed_jobs)
     if has_failed_pr_checks:
         if checks_summary["all_terminal"] and retries_used >= max_retries:
             actions.append("stop_exhausted_retries")
@@ -621,12 +681,14 @@ def collect_snapshot(args):
     checks_summary = summarize_checks(checks)
     workflow_runs = get_workflow_runs_for_sha(pr["repo"], pr["head_sha"])
     failed_runs = failed_runs_from_workflow_runs(workflow_runs, pr["head_sha"])
+    failed_jobs = failed_jobs_from_workflow_runs(pr["repo"], workflow_runs, pr["head_sha"])
 
     retries_used = current_retry_count(state, pr["head_sha"])
     actions = recommend_actions(
         pr,
         checks_summary,
         failed_runs,
+        failed_jobs,
         new_review_items,
         retries_used,
         args.max_flaky_retries,
@@ -641,6 +703,7 @@ def collect_snapshot(args):
         "pr": pr,
         "checks": checks_summary,
         "failed_runs": failed_runs,
+        "failed_jobs": failed_jobs,
         "new_review_items": new_review_items,
         "actions": actions,
         "retry_state": {