Skip to content

ci: cap parallel-ctest-containers concurrency to fix self-hosted NP/LR test failures#2

Merged
igorls merged 2 commits into
mainfrom
ci/throttle-ctest-concurrency
Jun 7, 2026
Merged

ci: cap parallel-ctest-containers concurrency to fix self-hosted NP/LR test failures#2
igorls merged 2 commits into
mainfrom
ci/throttle-ctest-concurrency

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Jun 7, 2026

Problem

parallel-ctest-containers launched a docker container for every discovered test at once (Promise.all over all of them). The nonparallelizable_tests / long_running_tests suites are heavy multi-nodeos integration tests, so on a self-hosted runner this starved CPU/RAM/ports and made them fail en masse — even trivial tests like get_account_test failed (starvation, not real bugs; they pass on ENF's sized runners and the code is the live-Jungle4-validated 1.2.x harvest).

Fix

Replace the unbounded fan-out with a bounded worker pool: at most N test containers run concurrently (N = CTEST_CONTAINER_CONCURRENCY env, default 4). results[i] stays aligned with tests[i], so the failure-log extraction is unchanged. Rebuilt dist/index.mjs via ncc; verified it loads + runs.

Validation (dispatched run on this branch, 12c/32G self-hosted runner)

  • NP/LR Tests: 12/12 pass — previously 0/16 (all failed under the old fan-out).
  • Platform-cache, all builds, all package jobs (incl. the distutils-fixed ubuntu24/26): green.
  • Final: 41/43 green. The only real failure is wasm_config_part1_unit_test_eos-vm-oc (Tests (asserton)) — a pre-existing, isolated OC unit-test issue consistent across every run, unrelated to this change (tracked separately).

Trade-off: capping concurrency lengthens the NP/LR wall-clock (heavy suites run 4-at-a-time). Tunable per-tier via the env var without a rebuild if needed.

The action launched one docker container per test ALL AT ONCE (Promise.all
over every discovered test). For the nonparallelizable_tests / long_running_tests
suites those are heavy multi-nodeos integration tests, so on a modest self-hosted
runner they starve CPU/RAM/ports and fail en masse (even trivial tests like
get_account_test fail — starvation, not real bugs; they pass on ENF's sized
runners). Replace the unbounded fan-out with a bounded worker pool: at most N
containers run concurrently (N = CTEST_CONTAINER_CONCURRENCY env, default 4).
results[i] stays aligned with tests[i] so failure-log extraction is unchanged.
Rebuilt dist/index.mjs via ncc; verified it loads + runs.
Copilot AI review requested due to automatic review settings June 7, 2026 19:00
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces bounded concurrency for running test containers in parallel, replacing the previous behavior of launching all containers at once. This prevents resource starvation on self-hosted runners by capping the concurrent executions to a default of 4 or a value specified by the CTEST_CONTAINER_CONCURRENCY environment variable. The review feedback suggests refactoring the unconventional for loop in the worker function to a more idiomatic while loop to improve code readability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +48 to +55
for(let i = next_test++; i < tests.length; i = next_test++) {
const t = tests[i];
// Clear any orphaned container of this name before reusing it (see note above).
child_process.spawnSync("docker", ["rm", "-f", t.name], {stdio:"ignore"});
results[i] = await new Promise(resolve => {
child_process.spawn("docker", ["run", "--security-opt", "seccomp=unconfined", "-e", "GITHUB_ACTIONS=True", "--name", t.name, "--init", "baseimage", "bash", "-c", `cd build; ctest --output-on-failure -R '^${t.name}$' --timeout ${test_timeout}`], {stdio:"inherit"}).on('close', code => resolve(code));
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The for loop with next_test++ in both the initialization and the increment expression is highly unconventional and can be difficult to read and reason about. Using a standard while loop is much more idiomatic and improves code readability.

      while (next_test < tests.length) {
         const i = next_test++;
         const t = tests[i];
         // Clear any orphaned container of this name before reusing it (see note above).
         child_process.spawnSync("docker", ["rm", "-f", t.name], {stdio:"ignore"});
         results[i] = await new Promise(resolve => {
            child_process.spawn("docker", ["run", "--security-opt", "seccomp=unconfined", "-e", "GITHUB_ACTIONS=True", "--name", t.name, "--init", "baseimage", "bash", "-c", `cd build; ctest --output-on-failure -R '^${t.name}$' --timeout ${test_timeout}`], {stdio:"inherit"}).on('close', code => resolve(code));
         });
      }

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Caps the parallel-ctest-containers GitHub Action’s test-container execution fan-out by introducing a bounded worker pool, preventing self-hosted runners from being overwhelmed when running heavy NP/LR integration suites.

Changes:

  • Replace unbounded Promise.all container launches with a bounded worker-pool runner.
  • Add CTEST_CONTAINER_CONCURRENCY (default 4) to control maximum concurrent test containers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +43
const max_concurrency = Math.max(1, parseInt(process.env.CTEST_CONTAINER_CONCURRENCY, 10) || 4);
console.log(`Running ${tests.length} '${tests_label}' test(s), up to ${max_concurrency} container(s) at a time`);
Comment on lines +52 to +54
results[i] = await new Promise(resolve => {
child_process.spawn("docker", ["run", "--security-opt", "seccomp=unconfined", "-e", "GITHUB_ACTIONS=True", "--name", t.name, "--init", "baseimage", "bash", "-c", `cd build; ctest --output-on-failure -R '^${t.name}$' --timeout ${test_timeout}`], {stdio:"inherit"}).on('close', code => resolve(code));
});
- use a plain while loop instead of the unconventional for(;;) with
  next_test++ in both clauses (readability; Gemini)
- parse CTEST_CONTAINER_CONCURRENCY with an explicit NaN check so a
  configured 0 clamps to 1 instead of silently falling back to 4 (Copilot)
- add an 'error' handler to the docker spawn so a spawn failure resolves
  the worker as failed instead of hanging the pool (Copilot)

Rebuilt dist/index.mjs; behaviour on the happy path is unchanged.
@igorls
Copy link
Copy Markdown
Member Author

igorls commented Jun 7, 2026

Thanks for the review — all three points addressed in 70aad08:

  • while loop (Gemini): replaced the for(let i = next_test++; …; i = next_test++) with a plain while (next_test < tests.length) { const i = next_test++; … }.
  • CTEST_CONTAINER_CONCURRENCY parsing (Copilot): now Number.isNaN(parsed) ? 4 : Math.max(1, parsed), so a configured 0 clamps to 1 instead of silently falling back to 4.
  • spawn error handler (Copilot): the worker's Promise now resolves to a non-zero code on the error event, so a docker-spawn failure marks the test failed instead of hanging the pool.

Happy-path behaviour is unchanged (the validated run had concurrency 4, NP/LR 12/12). Rebuilt dist/index.mjs via ncc and verified it loads + runs.

@igorls igorls merged commit 3887dfe into main Jun 7, 2026
1 check passed
@igorls igorls deleted the ci/throttle-ctest-concurrency branch June 7, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants