MicroShift CI Doctor: Deterministic prepare#81454
Conversation
The doctor step previously ran artifact downloads, PCP graph generation, and evidence extraction inside the doctor Claude session — burning its 45-minute timeout and turn budget while the model waited — and booted an entire 10-minute Claude session (doctor-refresh) to run one deterministic script plus a JSON check. Now the bash step owns the deterministic pipeline: - prepare/graphs/evidence/fetch-previous run before the doctor session; the session is invoked with --prepared and spends its (reduced, 40m) budget purely on root cause analysis - finalize (aggregation, cross-run history, HTML generation) runs right after the session, so the report no longer depends on the model ending its session gracefully - the doctor-refresh session is replaced by a direct doctor.sh refresh call with the --ignore keys derived from closed-bugs.json Step timeout goes to 1h45m: the preparation time that was previously hidden inside the doctor session budget is now additive, partially offset by the removed refresh session.
WalkthroughThe microshift-ci doctor step script was restructured to separate deterministic phases (prepare, graphs, evidence, finalize, refresh) from Claude-based analysis. The Claude doctor-refresh log tracking was removed, refresh now runs deterministically with an ignore list, and the step reference's timeout and documentation were updated accordingly. ChangesDoctor Workflow Determinism
Estimated code review effort: 3 (Moderate) | ~20 minutes Sequence Diagram(s)sequenceDiagram
participant Script as doctor-commands.sh
participant DoctorSh as doctor.sh
participant Claude as Claude plugin
participant Bugs as closed-bugs.json
Script->>DoctorSh: prepare
Script->>DoctorSh: graphs
Script->>DoctorSh: evidence
Script->>Claude: microshift-ci:doctor --prepared
Claude-->>Script: analysis result
Script->>DoctorSh: finalize
Script->>Bugs: read closed entries
Bugs-->>Script: closed bug keys
Script->>DoctorSh: refresh --ignore <keys>
Suggested reviewers: 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pmtk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
[REHEARSALNOTIFIER]
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals. Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ci-operator/step-registry/openshift/edge-tooling/microshift-ci/doctor/openshift-edge-tooling-microshift-ci-doctor-ref.yaml (1)
45-51: 🧹 Nitpick | 🔵 TrivialDocumentation accurately reflects the new deterministic/Claude split.
Confirm the new
1h45m0sbudget comfortably coversprepare+graphs+evidence+the 40-minute Claudedoctorsession+finalize+refresh, plus the other Claude sessions (create-bugs,close-stale-bugs,fix-test-bugs) referenced byatexit_handler, which aren't visible in this diff.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/openshift/edge-tooling/microshift-ci/doctor/openshift-edge-tooling-microshift-ci-doctor-ref.yaml` around lines 45 - 51, The doctor step timeout needs to account for all work launched by the pipeline, not just the deterministic phases and the main 40-minute Claude doctor session. Review the timeout and flow in openshift-edge-tooling-microshift-ci-doctor-ref.yaml alongside the atexit_handler-driven sessions (create-bugs, close-stale-bugs, fix-test-bugs) and make the budget explicit by increasing the timeout if needed or documenting the full set of Claude sessions covered by the current 1h45m0s window.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@ci-operator/step-registry/openshift/edge-tooling/microshift-ci/doctor/openshift-edge-tooling-microshift-ci-doctor-ref.yaml`:
- Around line 45-51: The doctor step timeout needs to account for all work
launched by the pipeline, not just the deterministic phases and the main
40-minute Claude doctor session. Review the timeout and flow in
openshift-edge-tooling-microshift-ci-doctor-ref.yaml alongside the
atexit_handler-driven sessions (create-bugs, close-stale-bugs, fix-test-bugs)
and make the budget explicit by increasing the timeout if needed or documenting
the full set of Claude sessions covered by the current 1h45m0s window.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1837abb2-452d-4641-95e5-f5b7b4b47d6b
📒 Files selected for processing (2)
ci-operator/step-registry/openshift/edge-tooling/microshift-ci/doctor/openshift-edge-tooling-microshift-ci-doctor-commands.shci-operator/step-registry/openshift/edge-tooling/microshift-ci/doctor/openshift-edge-tooling-microshift-ci-doctor-ref.yaml
|
@pmtk: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit
This update makes the MicroShift CI Doctor workflow more deterministic in OpenShift CI. The pre-processing and report-generation phases are now run as scripted, non-Claude steps, while Claude is reserved for the interactive analysis and bug-triage portions of the job. That reduces reliance on session-specific refresh logic, simplifies log handling, and makes the HTML/report refresh step reproducible by deriving ignore rules from closed stale-bug results.
The step definition was also adjusted to reflect the new execution model and given more runtime headroom by increasing the job timeout.