Skip to content

fix(handlers): stop flaky coverage CI from global-logger race in parallel tests#603

Closed
AbirAbbas wants to merge 1 commit into
mainfrom
fix/flaky-discovery-logging-global-logger-race
Closed

fix(handlers): stop flaky coverage CI from global-logger race in parallel tests#603
AbirAbbas wants to merge 1 commit into
mainfrom
fix/flaky-discovery-logging-global-logger-race

Conversation

@AbirAbbas
Copy link
Copy Markdown
Contributor

Summary

The coverage (control-plane) CI job (and the cascading coverage-summary job) flake because two tests mutate the process-global logger.Logger while running with t.Parallel():

  • TestDiscoveryLoggingIncludesOptionalRequestID
  • TestExecutionCleanupService_StartStopBranches

Both call setupExecutionCleanupTestLogger, which swaps logger.Logger and restores it via t.Cleanup. Run in parallel, a sibling test reassigns the global logger mid-run, so the discovery test's captured buffer comes back empty and its assertions fail:

coverage_handlers_92_target_test.go:380: "" does not contain "\"request_id\":\"req-123\""

coverage-summary then fails with missing input: control-plane.total.txt because the control-plane coverage step exited non-zero. The non-test build is unaffected, which is why this slipped past the required gates and only surfaced in the coverage job's full parallel run.

Fix

Drop t.Parallel() from both tests so they run in the serial phase, where nothing else mutates the global logger concurrently. This matches the existing convention — the logger-swapping tests in execution_cleanup_test.go are already non-parallel for the same reason.

Verification

  • go test ./internal/handlers/ -count=20 -run 'TestDiscoveryLoggingIncludesOptionalRequestID|StartStopBranches|TestExecution' — passes (this exact command reproduced the failure before the fix)
  • full go test ./internal/handlers/ — passes
  • go vet clean; no new gofmt drift introduced (the file has pre-existing struct-alignment drift on main, left untouched to keep this PR scoped)

Out of scope (flagged, not fixed here)

Running the package under -race surfaces several separate, pre-existing data races in the heartbeat coverage tests (processHeartbeatAsync async goroutines writing to test stubs in TestHeartbeatHandler_AdditionalCoverage / TestDiscoveryAndNodeHandlers_AdditionalCoverage). The coverage CI job does not run with -race, so those are not what breaks it — worth a separate follow-up.

🤖 Generated with Claude Code

…llel tests

TestDiscoveryLoggingIncludesOptionalRequestID and
TestExecutionCleanupService_StartStopBranches both call
setupExecutionCleanupTestLogger, which swaps the process-global
logger.Logger, while also calling t.Parallel(). When they ran alongside
another parallel test, a sibling reassigned the global logger mid-run, so
the discovery test's captured buffer came back empty and its assertions
("discovery request failed", `"request_id":"req-123"`, `"format":"compact"`)
failed. This broke the `coverage (control-plane)` job and cascaded into
`coverage-summary` (missing control-plane.total.txt).

The non-test build is unaffected, so it slipped past the required gates
and only showed up in the coverage job's full parallel run.

Drop t.Parallel() from both tests so they run in the serial phase, where
nothing else mutates the global logger concurrently -- matching the
existing non-parallel logger-swapping tests in execution_cleanup_test.go.
Verified by running the previously-failing command 20x clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AbirAbbas AbirAbbas requested a review from a team as a code owner May 29, 2026 22:16
@github-actions
Copy link
Copy Markdown
Contributor

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.50% 87.30% ↑ +0.20 pp 🟡
sdk-go 92.00% 90.70% ↑ +1.30 pp 🟢
sdk-python 93.73% 93.63% ↑ +0.10 pp 🟢
sdk-typescript 92.80% 92.56% ↑ +0.24 pp 🟢
web-ui 89.93% 90.01% ↓ -0.08 pp 🟡
aggregate 89.03% 89.01% ↑ +0.02 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 0 ➖ no changes
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

@AbirAbbas
Copy link
Copy Markdown
Contributor Author

Folded into #602 — keeping it to a single PR. The coverage flake fix (drop t.Parallel() from the two global-logger-swapping tests) is now the first commit on that branch.

@AbirAbbas AbirAbbas closed this May 29, 2026
@AbirAbbas AbirAbbas deleted the fix/flaky-discovery-logging-global-logger-race branch May 29, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant