Skip to content

[codex] Stabilize Windows Bazel test flakes#17895

Closed
jgershen-oai wants to merge 13 commits intomainfrom
codex/fix-windows-marketplace-source
Closed

[codex] Stabilize Windows Bazel test flakes#17895
jgershen-oai wants to merge 13 commits intomainfrom
codex/fix-windows-marketplace-source

Conversation

@jgershen-oai
Copy link
Copy Markdown
Collaborator

@jgershen-oai jgershen-oai commented Apr 15, 2026

Summary

This PR fixes and stabilizes several test areas that were making the Bazel workflow noisy. The first three commits address Windows test flakes; the later commits address the cross-platform app-server/TUI failures that surfaced while CI was running.

  1. Marketplace local source parsing

    • Treat platform-absolute marketplace sources as local paths so Windows drive-letter paths are not parsed as invalid git sources.
    • Add Windows-only coverage for backslash-relative and rooted local path forms.
    • Keep marketplace metadata/config tests on the production config-writing path so Windows paths are TOML-escaped the same way as real config updates.
  2. PowerShell-dependent timing tests

    • Add a test helper for building shell command responses from raw command strings so tests exercise the same command parsing and quoting path used on Windows.
    • Use that helper for the thread unsubscribe sleep test, where command parsing/quoting could otherwise affect Start-Sleep timing.
    • Add -NoProfile to PowerShell exec-test command vectors so user or runner profile startup cannot perturb short sleep/cancellation timing checks.
    • Run the MCP shell-approval test command as a non-login shell command and give the Windows PowerShell command more execution budget.
  3. Multi-agent and agent-control interrupt/history timing

    • Add a start handshake for the synthetic never-ending task before sending followup_task interrupt=true.
    • Avoid waiting for the child's CleanBackgroundTerminals bootstrap op to emit TurnStarted, since that op does not create a regular turn.
    • Give redirected-envelope and subagent-notification history assertions more time under loaded Bazel executors, and improve abort-wait diagnostics so future failures report unexpected aborted turns.
  4. App-server Bazel initialize/startup races

    • Cap the app-server integration test binary at --test-threads=2 under Bazel to reduce simultaneous codex-app-server child-process startup pressure.
    • Raise the app-server integration test default initialize/read budget from 10s to 30s, matching the macOS/fs/app-list/command-exec failures that timed out before initialize returned under load.
    • Make the plugin-list fail-open test use a local mock ChatGPT base URL for featured-plugin warming instead of reaching real chatgpt.com.
    • Make proactive auth-refresh tests wait until startup refresh has actually hit the mock server before sending the request that expects the second refresh.
  5. TUI memory-mode test isolation

    • Keep the embedded app-server test's sqlite_home aligned with its temporary codex_home, so Bazel and Cargo both assert against the same isolated state DB.
    • Mark state DB backfill complete before starting the embedded app-server, matching the app-server memory-mode test setup and preventing async rollout backfill from racing the memory-mode write.
    • Poll for the state DB memory-mode update to land instead of racing the asynchronous persistence path.

Root Cause

The original deterministic Windows marketplace failure came from the marketplace source parser only recognizing POSIX-style local path prefixes such as ./, ../, /, and ~/. A source like C:\\Users\\... fell through to the git source parser and failed with invalid marketplace source format. A related config test manually interpolated Windows paths into TOML, which let backslashes be interpreted as TOML escapes.

Separately, the slow/flaky Windows tests were timing-sensitive. Some tests built shell responses without going through the command parser used by Windows command execution, and short PowerShell commands can be distorted by quoting, login-shell profile startup, or too-small command execution budgets. The multi-agent/agent-control failures were test ordering races around asynchronous history writes.

The latest macOS/Windows app-server failures all timed out waiting for JSON-RPC initialize responses across unrelated tests. That points at test-process startup contention under Bazel rather than fs/config/plugin behavior, so the targeted fixes reduce app-server integration-test concurrency and give startup/read handshakes a realistic remote-executor budget. One plugin-list test also leaked a real chatgpt.com request through featured-plugin cache warming, which made its fail-open behavior depend on external network/auth latency.

The TUI memory-mode failure was test isolation plus state backfill/persistence timing: the test moved codex_home to a temp directory but left sqlite_home on the helper's original configuration, started an embedded app-server without marking state backfill complete, and then immediately read state that is written asynchronously.

Validation

  • just fmt
  • cargo test -p codex-core marketplace_add
  • cargo test -p codex-cli marketplace_add
  • cargo test -p codex-app-server suite::v2::thread_unsubscribe::thread_unsubscribe_during_turn_keeps_turn_running -- --exact
  • cargo test -p codex-app-server suite::v2::plugin_list::plugin_list_force_remote_sync_returns_remote_sync_error_on_fail_open -- --exact
  • cargo test -p codex-app-server proactive_refresh -- --test-threads=2
  • cargo test -p codex-app-server -- --test-threads=2
  • cargo clippy -p codex-app-server --tests -- -D warnings
  • cargo test -p codex-core tools::handlers::multi_agents::tests::multi_agent_v2_followup_task_interrupts_busy_child_without_losing_message -- --exact
  • cargo test -p codex-core agent::control::tests::spawn_child_completion_notifies_parent_history -- --exact
  • cargo test -p codex-core exec_full_buffer_capture_ignores_expiration
  • cargo test -p codex-core process_exec_tool_call_preserves_full_buffer_capture_policy
  • cargo test -p codex-core process_exec_tool_call_respects_cancellation_token
  • cargo test -p codex-mcp-server suite::codex_tool::test_shell_command_approval_triggers_elicitation -- --exact
  • cargo test -p codex-mcp-server
  • cargo test -p codex-tui app::tests::update_memory_settings_updates_current_thread_memory_mode -- --exact
  • git diff --check
  • bazel query //codex-rs/app-server:app-server-all-test
  • bazel build --config=argument-comment-lint -- //codex-rs/app-server/tests/common:common //codex-rs/app-server/tests/common:common-unit-tests-bin

Notes:

  • just fix -p codex-app-server was attempted, but this local environment denied Cargo's TCP lock listener before Clippy started. The non-fixing scoped Clippy check above passed.
  • The branch has been rebased onto current main, which includes the cargo-deny dependency update.
  • Local Bazel execution on this machine is currently blocked by the local Bazel/Rust toolchain setup, so the next meaningful full Bazel signal should come from remote CI.

@jgershen-oai jgershen-oai changed the title [codex] Fix Windows marketplace local source parsing [codex] Stabilize Windows Bazel test flakes Apr 15, 2026
@jgershen-oai jgershen-oai force-pushed the codex/fix-windows-marketplace-source branch 2 times, most recently from 318e5e4 to 5bdbfc0 Compare April 15, 2026 19:38
@jgershen-oai jgershen-oai force-pushed the codex/fix-windows-marketplace-source branch from 49525ab to 1068a41 Compare April 15, 2026 20:22
@jgershen-oai jgershen-oai marked this pull request as ready for review April 15, 2026 21:29
@jgershen-oai
Copy link
Copy Markdown
Collaborator Author

Closing this broad CI-stabilization PR in favor of a narrower extraction stacked on #18000: Windows filepath handling plus shell command string construction only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant