[codex] Stabilize Windows Bazel test flakes by jgershen-oai · Pull Request #17895 · openai/codex

jgershen-oai · 2026-04-15T05:36:00Z

Summary

This PR fixes and stabilizes several test areas that were making the Bazel workflow noisy. The first three commits address Windows test flakes; the later commits address the cross-platform app-server/TUI failures that surfaced while CI was running.

Marketplace local source parsing
- Treat platform-absolute marketplace sources as local paths so Windows drive-letter paths are not parsed as invalid git sources.
- Add Windows-only coverage for backslash-relative and rooted local path forms.
- Keep marketplace metadata/config tests on the production config-writing path so Windows paths are TOML-escaped the same way as real config updates.
PowerShell-dependent timing tests
- Add a test helper for building shell command responses from raw command strings so tests exercise the same command parsing and quoting path used on Windows.
- Use that helper for the thread unsubscribe sleep test, where command parsing/quoting could otherwise affect Start-Sleep timing.
- Add -NoProfile to PowerShell exec-test command vectors so user or runner profile startup cannot perturb short sleep/cancellation timing checks.
- Run the MCP shell-approval test command as a non-login shell command and give the Windows PowerShell command more execution budget.
Multi-agent and agent-control interrupt/history timing
- Add a start handshake for the synthetic never-ending task before sending followup_task interrupt=true.
- Avoid waiting for the child's CleanBackgroundTerminals bootstrap op to emit TurnStarted, since that op does not create a regular turn.
- Give redirected-envelope and subagent-notification history assertions more time under loaded Bazel executors, and improve abort-wait diagnostics so future failures report unexpected aborted turns.
App-server Bazel initialize/startup races
- Cap the app-server integration test binary at --test-threads=2 under Bazel to reduce simultaneous codex-app-server child-process startup pressure.
- Raise the app-server integration test default initialize/read budget from 10s to 30s, matching the macOS/fs/app-list/command-exec failures that timed out before initialize returned under load.
- Make the plugin-list fail-open test use a local mock ChatGPT base URL for featured-plugin warming instead of reaching real chatgpt.com.
- Make proactive auth-refresh tests wait until startup refresh has actually hit the mock server before sending the request that expects the second refresh.
TUI memory-mode test isolation
- Keep the embedded app-server test's sqlite_home aligned with its temporary codex_home, so Bazel and Cargo both assert against the same isolated state DB.
- Mark state DB backfill complete before starting the embedded app-server, matching the app-server memory-mode test setup and preventing async rollout backfill from racing the memory-mode write.
- Poll for the state DB memory-mode update to land instead of racing the asynchronous persistence path.

Root Cause

The original deterministic Windows marketplace failure came from the marketplace source parser only recognizing POSIX-style local path prefixes such as ./, ../, /, and ~/. A source like C:\\Users\\... fell through to the git source parser and failed with invalid marketplace source format. A related config test manually interpolated Windows paths into TOML, which let backslashes be interpreted as TOML escapes.

Separately, the slow/flaky Windows tests were timing-sensitive. Some tests built shell responses without going through the command parser used by Windows command execution, and short PowerShell commands can be distorted by quoting, login-shell profile startup, or too-small command execution budgets. The multi-agent/agent-control failures were test ordering races around asynchronous history writes.

The latest macOS/Windows app-server failures all timed out waiting for JSON-RPC initialize responses across unrelated tests. That points at test-process startup contention under Bazel rather than fs/config/plugin behavior, so the targeted fixes reduce app-server integration-test concurrency and give startup/read handshakes a realistic remote-executor budget. One plugin-list test also leaked a real chatgpt.com request through featured-plugin cache warming, which made its fail-open behavior depend on external network/auth latency.

The TUI memory-mode failure was test isolation plus state backfill/persistence timing: the test moved codex_home to a temp directory but left sqlite_home on the helper's original configuration, started an embedded app-server without marking state backfill complete, and then immediately read state that is written asynchronously.

Validation

just fmt
cargo test -p codex-core marketplace_add
cargo test -p codex-cli marketplace_add
cargo test -p codex-app-server suite::v2::thread_unsubscribe::thread_unsubscribe_during_turn_keeps_turn_running -- --exact
cargo test -p codex-app-server suite::v2::plugin_list::plugin_list_force_remote_sync_returns_remote_sync_error_on_fail_open -- --exact
cargo test -p codex-app-server proactive_refresh -- --test-threads=2
cargo test -p codex-app-server -- --test-threads=2
cargo clippy -p codex-app-server --tests -- -D warnings
cargo test -p codex-core tools::handlers::multi_agents::tests::multi_agent_v2_followup_task_interrupts_busy_child_without_losing_message -- --exact
cargo test -p codex-core agent::control::tests::spawn_child_completion_notifies_parent_history -- --exact
cargo test -p codex-core exec_full_buffer_capture_ignores_expiration
cargo test -p codex-core process_exec_tool_call_preserves_full_buffer_capture_policy
cargo test -p codex-core process_exec_tool_call_respects_cancellation_token
cargo test -p codex-mcp-server suite::codex_tool::test_shell_command_approval_triggers_elicitation -- --exact
cargo test -p codex-mcp-server
cargo test -p codex-tui app::tests::update_memory_settings_updates_current_thread_memory_mode -- --exact
git diff --check
bazel query //codex-rs/app-server:app-server-all-test
bazel build --config=argument-comment-lint -- //codex-rs/app-server/tests/common:common //codex-rs/app-server/tests/common:common-unit-tests-bin

Notes:

just fix -p codex-app-server was attempted, but this local environment denied Cargo's TCP lock listener before Clippy started. The non-fixing scoped Clippy check above passed.
The branch has been rebased onto current main, which includes the cargo-deny dependency update.
Local Bazel execution on this machine is currently blocked by the local Bazel/Rust toolchain setup, so the next meaningful full Bazel signal should come from remote CI.

jgershen-oai · 2026-04-15T22:32:06Z

Closing this broad CI-stabilization PR in favor of a narrower extraction stacked on #18000: Windows filepath handling plus shell command string construction only.

jgershen-oai changed the title ~~[codex] Fix Windows marketplace local source parsing~~ [codex] Stabilize Windows Bazel test flakes Apr 15, 2026

jgershen-oai force-pushed the codex/fix-windows-marketplace-source branch 2 times, most recently from 318e5e4 to 5bdbfc0 Compare April 15, 2026 19:38

jgershen-oai added 13 commits April 15, 2026 13:21

Fix Windows marketplace local source parsing

6703333

Stabilize Windows thread unsubscribe sleep test

d61d313

Fix marketplace config test and argument lint

6a3962f

Stabilize multi-agent interrupt test

7e47e85

Use NoProfile in PowerShell exec tests

a3719d9

Fix multi-agent interrupt test startup wait

a04370f

Stabilize Bazel test timing under load

c7d99c6

Stabilize remaining Bazel timing flakes

b0bea02

Fix app-server argument lint

63230bc

Fix TUI memory mode test isolation

5676b84

Wait for TUI memory mode persistence

55b2274

Stabilize app-server CI timeouts

5976ceb

Stabilize TUI memory mode state test

1068a41

jgershen-oai force-pushed the codex/fix-windows-marketplace-source branch from 49525ab to 1068a41 Compare April 15, 2026 20:22

jgershen-oai marked this pull request as ready for review April 15, 2026 21:29

jgershen-oai requested review from aibrahim-oai and jif-oai April 15, 2026 21:30

jgershen-oai closed this Apr 15, 2026

jgershen-oai mentioned this pull request Apr 15, 2026

Fix Windows marketplace paths and shell command strings #18019

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Stabilize Windows Bazel test flakes#17895

[codex] Stabilize Windows Bazel test flakes#17895
jgershen-oai wants to merge 13 commits intomainfrom
codex/fix-windows-marketplace-source

jgershen-oai commented Apr 15, 2026 •

edited

Loading

Uh oh!

jgershen-oai commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jgershen-oai commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Validation

Uh oh!

jgershen-oai commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jgershen-oai commented Apr 15, 2026 •

edited

Loading