[NO-REVIEW] Batch WASM CoreCLR library test suites on Helix#126157
[NO-REVIEW] Batch WASM CoreCLR library test suites on Helix#126157radekdoulik wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce Helix queue pressure for WASM CoreCLR library testing by batching many individual test-suite work items into a smaller number of larger work items, using a batch runner to execute multiple suites sequentially with isolated result uploads.
Changes:
- Add a WASM batch runner script to unzip and run multiple test suites sequentially inside one Helix work item.
- Extend
sendtohelix-browser.targetsto optionally generate batched Helix work items via an MSBuild bin-packing step and per-batch timeout computation. - Adjust browser/CoreCLR Helix and xharness timeouts, and update the browser/CoreCLR test exclusion list.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/libraries/tests.proj |
Updates the browser/CoreCLR disabled-test list (significant exclusion removals). |
src/libraries/sendtohelixhelp.proj |
Increases default Helix work item timeout for browser/CoreCLR. |
src/libraries/sendtohelix-browser.targets |
Adds batching mode, grouping/timeout tasks, and a new target to emit batched Helix work items. |
eng/testing/tests.wasm.targets |
Increases xharness timeout default for CoreCLR WASM test runs. |
eng/testing/WasmBatchRunner.sh |
New script to run multiple suite zips in one work item with per-suite upload directories. |
Reduce Helix queue pressure by grouping ~172 individual WASM CoreCLR library test work items into ~23 batched work items (87% reduction). Changes: - Add eng/testing/WasmBatchRunner.sh: batch runner that extracts and runs multiple test suites sequentially within a single work item, with per-suite result isolation - Add greedy bin-packing inline MSBuild task (_GroupWorkItems) that distributes test archives into balanced batches by file size - Add _AddBatchedWorkItemsForLibraryTests target gated on WasmBatchLibraryTests property (defaults true for CoreCLR+Chrome) - Sample apps excluded from batching, kept as individual work items - Can be disabled with /p:WasmBatchLibraryTests=false Expected impact: - 172 → ~23 Helix work items (87% queue pressure reduction) - ~6% machine time savings (~26 minutes) - Longest batch ~18 minutes (well-balanced bin-packing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused EXECUTION_DIR variable from WasmBatchRunner.sh - Use PayloadArchive (ZIP) instead of PayloadDirectory to pass sendtohelixhelp.proj validation - Use HelixCommand with RunTests.sh→WasmBatchRunner.sh substitution to preserve env var setup and pre-commands Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cf67805 to
a751466
Compare
This comment has been minimized.
This comment has been minimized.
Batch--1 (1 item) and Batch-5 (8 items) timed out in CI because the 2min/suite formula was too aggressive. System.IO.Compression alone takes 11m, System.Security.Cryptography takes 17m, and Microsoft.Bcl.Memory takes 6m. With 19/21 batches passing and the longest at 17m24s, a 30m minimum provides adequate headroom. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| // 20 minutes per suite to account for WASM startup overhead + test execution; | ||
| // minimum 30 minutes to handle the heaviest individual suites (e.g. Cryptography ~17m) | ||
| int totalMinutes = Math.Max(30, count * 20); | ||
| var ts = TimeSpan.FromMinutes(totalMinutes); | ||
|
|
||
| var helixItem = new TaskItem(ItemPrefix + "Batch-" + bid); | ||
| helixItem.SetMetadata("BatchDir", BatchOutputDir + "batch-" + bid + "/"); | ||
| helixItem.SetMetadata("Timeout", ts.ToString(@"hh\:mm\:ss")); | ||
| result.Add(helixItem); |
There was a problem hiding this comment.
The per-batch timeout logic is inconsistent with the comment immediately below and the PR description: this task currently computes totalMinutes = Math.Max(30, count * 20) (20 min/suite, 30 min minimum), but the target comment says "2 minutes per suite, minimum 10 minutes". Please align the code and comments (and ensure the resulting timeout is appropriate for the longest WASM suites) to avoid unexpectedly huge timeouts or unintended work item timeouts.
| <!-- Stage each batch: copy ZIPs and the runner script into a per-batch directory --> | ||
| <MakeDir Directories="$(IntermediateOutputPath)helix-batches/batch-%(_WasmUniqueBatchId.Identity)/" /> | ||
| <Copy SourceFiles="@(_WasmGroupedItem)" DestinationFolder="$(IntermediateOutputPath)helix-batches/batch-%(BatchId)/" /> | ||
| <Copy SourceFiles="$(RepositoryEngineeringDir)testing/WasmBatchRunner.sh" | ||
| DestinationFolder="$(IntermediateOutputPath)helix-batches/batch-%(_WasmUniqueBatchId.Identity)/" /> | ||
|
|
||
| <!-- Compute per-batch timeout: 2 minutes per suite, minimum 10 minutes --> | ||
| <_ComputeBatchTimeout GroupedItems="@(_WasmGroupedItem)" BatchIds="@(_WasmUniqueBatchId)" | ||
| ItemPrefix="$(WorkItemPrefix)" BatchOutputDir="$(IntermediateOutputPath)helix-batches/"> | ||
| <Output TaskParameter="TimedItems" ItemName="_WasmTimedBatchItem" /> | ||
| </_ComputeBatchTimeout> | ||
|
|
||
| <!-- Create ZIP archives from batch directories (sendtohelixhelp.proj requires PayloadArchive) --> | ||
| <ZipDirectory SourceDirectory="%(_WasmTimedBatchItem.BatchDir)" | ||
| DestinationFile="$(IntermediateOutputPath)helix-batches/%(_WasmTimedBatchItem.Identity).zip" | ||
| Overwrite="true" /> |
There was a problem hiding this comment.
The batch staging directory under $(IntermediateOutputPath)helix-batches/ is only created/copied into, never cleaned. On incremental builds, stale ZIPs from a previous run can remain in batch-* directories and get re-zipped into the payload, causing unexpected extra suites to run. Consider deleting $(IntermediateOutputPath)helix-batches/ (or each batch-* directory) before copying, or otherwise ensuring the batch directories are empty before zipping.
| echo "" | ||
| echo "Total: $SUITE_COUNT | Passed: $((SUITE_COUNT - FAIL_COUNT)) | Failed: $FAIL_COUNT" | ||
|
|
||
| if [[ $FAIL_COUNT -ne 0 ]]; then | ||
| exit 1 | ||
| fi | ||
|
|
||
| exit 0 |
There was a problem hiding this comment.
This script leaves HELIX_WORKITEM_UPLOAD_ROOT set to the last suite’s subdirectory when it exits. Helix post-commands (e.g., the CoreCLR dump-doc generation in sendtohelixhelp.proj) may run after the main command and use HELIX_WORKITEM_UPLOAD_ROOT to decide where to write artifacts; consider restoring HELIX_WORKITEM_UPLOAD_ROOT back to ORIGINAL_UPLOAD_ROOT before printing the final summary / exiting so post-commands still write to the expected root.
|
Note This review was generated by Copilot (Claude Opus 4.6 + GPT-5.4 multi-model review). 🤖 Copilot Code Review — PR #126157Holistic AssessmentMotivation: Reducing Helix queue pressure from 172 → ~23 work items is a meaningful infrastructure improvement. The problem is real — queue slot contention slows CI for everyone — and the 87% reduction is substantial. Approach: Greedy bin-packing by file size is a reasonable heuristic for balancing batch runtimes. The Summary: Detailed Findings
|
Note
This PR description was AI/Copilot-generated.
Summary
Reduce Helix queue pressure by grouping ~172 individual WASM CoreCLR library test work items into ~23 batched work items (87% queue pressure reduction).
Changes
eng/testing/WasmBatchRunner.sh(new): Batch runner script that extracts and runs multiple test suites sequentially within a single Helix work item, with per-suite result isolation via separateHELIX_WORKITEM_UPLOAD_ROOTdirectories.src/libraries/sendtohelix-browser.targets(modified):WasmBatchLibraryTestsproperty (defaultstruefor CoreCLR+Chrome,falseotherwise)_GroupWorkItemsinline MSBuild task: greedy bin-packing by file size, large suites (>50MB) stay solo_ComputeBatchTimeoutinline task: 2 min/suite timeout, 10 min minimum_AddBatchedWorkItemsForLibraryTeststarget: creates balanced batched work itemsWasmBatchLibraryTests != trueExpected Impact
The primary benefit is queue pressure reduction — 149 fewer items competing for Helix machines, which helps during queue saturation periods. Machine time savings are modest (~6%) because per-suite Chrome/WASM startup overhead is not eliminated by batching.
Opt-out
Disable with
/p:WasmBatchLibraryTests=falseto fall back to individual work items.Future Work