Skip to content

[NO-REVIEW] Batch WASM CoreCLR library test suites on Helix#126157

Draft
radekdoulik wants to merge 3 commits intomainfrom
batch-wasm-coreclr-library-tests
Draft

[NO-REVIEW] Batch WASM CoreCLR library test suites on Helix#126157
radekdoulik wants to merge 3 commits intomainfrom
batch-wasm-coreclr-library-tests

Conversation

@radekdoulik
Copy link
Member

Note

This PR description was AI/Copilot-generated.

Summary

Reduce Helix queue pressure by grouping ~172 individual WASM CoreCLR library test work items into ~23 batched work items (87% queue pressure reduction).

Changes

  • eng/testing/WasmBatchRunner.sh (new): Batch runner script that extracts and runs multiple test suites sequentially within a single Helix work item, with per-suite result isolation via separate HELIX_WORKITEM_UPLOAD_ROOT directories.
  • src/libraries/sendtohelix-browser.targets (modified):
    • WasmBatchLibraryTests property (defaults true for CoreCLR+Chrome, false otherwise)
    • _GroupWorkItems inline MSBuild task: greedy bin-packing by file size, large suites (>50MB) stay solo
    • _ComputeBatchTimeout inline task: 2 min/suite timeout, 10 min minimum
    • _AddBatchedWorkItemsForLibraryTests target: creates balanced batched work items
    • Sample apps excluded from batching, kept as individual work items
    • Original target gated on WasmBatchLibraryTests != true

Expected Impact

Metric Before After Change
Work items 172 ~23 -87%
Machine time 437m (7.3h) ~411m (6.8h) -6%
Longest work item 17m ~18m +1m
Queue slots used 172 23 -149

The primary benefit is queue pressure reduction — 149 fewer items competing for Helix machines, which helps during queue saturation periods. Machine time savings are modest (~6%) because per-suite Chrome/WASM startup overhead is not eliminated by batching.

Opt-out

Disable with /p:WasmBatchLibraryTests=false to fall back to individual work items.

Future Work

  • Reuse Chrome instance across test suites within a batch (requires xharness changes, would save additional ~80-160m)
  • Parallel xharness invocations within batches

@radekdoulik radekdoulik added this to the Future milestone Mar 26, 2026
@radekdoulik radekdoulik added NO-REVIEW Experimental/testing PR, do NOT review it area-Infrastructure-coreclr labels Mar 26, 2026
Copilot AI review requested due to automatic review settings March 26, 2026 16:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce Helix queue pressure for WASM CoreCLR library testing by batching many individual test-suite work items into a smaller number of larger work items, using a batch runner to execute multiple suites sequentially with isolated result uploads.

Changes:

  • Add a WASM batch runner script to unzip and run multiple test suites sequentially inside one Helix work item.
  • Extend sendtohelix-browser.targets to optionally generate batched Helix work items via an MSBuild bin-packing step and per-batch timeout computation.
  • Adjust browser/CoreCLR Helix and xharness timeouts, and update the browser/CoreCLR test exclusion list.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/libraries/tests.proj Updates the browser/CoreCLR disabled-test list (significant exclusion removals).
src/libraries/sendtohelixhelp.proj Increases default Helix work item timeout for browser/CoreCLR.
src/libraries/sendtohelix-browser.targets Adds batching mode, grouping/timeout tasks, and a new target to emit batched Helix work items.
eng/testing/tests.wasm.targets Increases xharness timeout default for CoreCLR WASM test runs.
eng/testing/WasmBatchRunner.sh New script to run multiple suite zips in one work item with per-suite upload directories.

radekdoulik and others added 2 commits March 26, 2026 18:06
Reduce Helix queue pressure by grouping ~172 individual WASM CoreCLR
library test work items into ~23 batched work items (87% reduction).

Changes:
- Add eng/testing/WasmBatchRunner.sh: batch runner that extracts and
  runs multiple test suites sequentially within a single work item,
  with per-suite result isolation
- Add greedy bin-packing inline MSBuild task (_GroupWorkItems) that
  distributes test archives into balanced batches by file size
- Add _AddBatchedWorkItemsForLibraryTests target gated on
  WasmBatchLibraryTests property (defaults true for CoreCLR+Chrome)
- Sample apps excluded from batching, kept as individual work items
- Can be disabled with /p:WasmBatchLibraryTests=false

Expected impact:
- 172 → ~23 Helix work items (87% queue pressure reduction)
- ~6% machine time savings (~26 minutes)
- Longest batch ~18 minutes (well-balanced bin-packing)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove unused EXECUTION_DIR variable from WasmBatchRunner.sh
- Use PayloadArchive (ZIP) instead of PayloadDirectory to pass
  sendtohelixhelp.proj validation
- Use HelixCommand with RunTests.sh→WasmBatchRunner.sh substitution
  to preserve env var setup and pre-commands

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radekdoulik radekdoulik force-pushed the batch-wasm-coreclr-library-tests branch from cf67805 to a751466 Compare March 26, 2026 17:09
@github-actions

This comment has been minimized.

Batch--1 (1 item) and Batch-5 (8 items) timed out in CI because
the 2min/suite formula was too aggressive. System.IO.Compression
alone takes 11m, System.Security.Cryptography takes 17m, and
Microsoft.Bcl.Memory takes 6m. With 19/21 batches passing and
the longest at 17m24s, a 30m minimum provides adequate headroom.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 26, 2026 22:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +293 to +301
// 20 minutes per suite to account for WASM startup overhead + test execution;
// minimum 30 minutes to handle the heaviest individual suites (e.g. Cryptography ~17m)
int totalMinutes = Math.Max(30, count * 20);
var ts = TimeSpan.FromMinutes(totalMinutes);

var helixItem = new TaskItem(ItemPrefix + "Batch-" + bid);
helixItem.SetMetadata("BatchDir", BatchOutputDir + "batch-" + bid + "/");
helixItem.SetMetadata("Timeout", ts.ToString(@"hh\:mm\:ss"));
result.Add(helixItem);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-batch timeout logic is inconsistent with the comment immediately below and the PR description: this task currently computes totalMinutes = Math.Max(30, count * 20) (20 min/suite, 30 min minimum), but the target comment says "2 minutes per suite, minimum 10 minutes". Please align the code and comments (and ensure the resulting timeout is appropriate for the longest WASM suites) to avoid unexpectedly huge timeouts or unintended work item timeouts.

Copilot uses AI. Check for mistakes.
Comment on lines +454 to +469
<!-- Stage each batch: copy ZIPs and the runner script into a per-batch directory -->
<MakeDir Directories="$(IntermediateOutputPath)helix-batches/batch-%(_WasmUniqueBatchId.Identity)/" />
<Copy SourceFiles="@(_WasmGroupedItem)" DestinationFolder="$(IntermediateOutputPath)helix-batches/batch-%(BatchId)/" />
<Copy SourceFiles="$(RepositoryEngineeringDir)testing/WasmBatchRunner.sh"
DestinationFolder="$(IntermediateOutputPath)helix-batches/batch-%(_WasmUniqueBatchId.Identity)/" />

<!-- Compute per-batch timeout: 2 minutes per suite, minimum 10 minutes -->
<_ComputeBatchTimeout GroupedItems="@(_WasmGroupedItem)" BatchIds="@(_WasmUniqueBatchId)"
ItemPrefix="$(WorkItemPrefix)" BatchOutputDir="$(IntermediateOutputPath)helix-batches/">
<Output TaskParameter="TimedItems" ItemName="_WasmTimedBatchItem" />
</_ComputeBatchTimeout>

<!-- Create ZIP archives from batch directories (sendtohelixhelp.proj requires PayloadArchive) -->
<ZipDirectory SourceDirectory="%(_WasmTimedBatchItem.BatchDir)"
DestinationFile="$(IntermediateOutputPath)helix-batches/%(_WasmTimedBatchItem.Identity).zip"
Overwrite="true" />
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch staging directory under $(IntermediateOutputPath)helix-batches/ is only created/copied into, never cleaned. On incremental builds, stale ZIPs from a previous run can remain in batch-* directories and get re-zipped into the payload, causing unexpected extra suites to run. Consider deleting $(IntermediateOutputPath)helix-batches/ (or each batch-* directory) before copying, or otherwise ensuring the batch directories are empty before zipping.

Copilot uses AI. Check for mistakes.
Comment on lines +80 to +87
echo ""
echo "Total: $SUITE_COUNT | Passed: $((SUITE_COUNT - FAIL_COUNT)) | Failed: $FAIL_COUNT"

if [[ $FAIL_COUNT -ne 0 ]]; then
exit 1
fi

exit 0
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script leaves HELIX_WORKITEM_UPLOAD_ROOT set to the last suite’s subdirectory when it exits. Helix post-commands (e.g., the CoreCLR dump-doc generation in sendtohelixhelp.proj) may run after the main command and use HELIX_WORKITEM_UPLOAD_ROOT to decide where to write artifacts; consider restoring HELIX_WORKITEM_UPLOAD_ROOT back to ORIGINAL_UPLOAD_ROOT before printing the final summary / exiting so post-commands still write to the expected root.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

Note

This review was generated by Copilot (Claude Opus 4.6 + GPT-5.4 multi-model review).

🤖 Copilot Code Review — PR #126157

Holistic Assessment

Motivation: Reducing Helix queue pressure from 172 → ~23 work items is a meaningful infrastructure improvement. The problem is real — queue slot contention slows CI for everyone — and the 87% reduction is substantial.

Approach: Greedy bin-packing by file size is a reasonable heuristic for balancing batch runtimes. The WasmBatchRunner.sh wrapper with per-suite upload root isolation preserves test result granularity. Gating behind WasmBatchLibraryTests with an opt-out escape hatch is good practice. The HelixCommand.Replace('./RunTests.sh', ...) approach preserves env var setup and dev-certs from the original command chain, which addresses a key concern from earlier review feedback.

Summary: ⚠️ Needs Human Review. The overall design is sound and the implementation addresses earlier review feedback (PayloadArchive validation, HelixCommand preservation). However, there is a stale comment that actively misleads about timeout values, and a few robustness concerns around incremental builds and disk usage that a human reviewer should weigh. No blocking correctness bugs found, but the issues below merit attention.


Detailed Findings

⚠️ Stale Comment — Timeout values in XML comment don't match code

src/libraries/sendtohelix-browser.targets:460:

<!-- Compute per-batch timeout: 2 minutes per suite, minimum 10 minutes -->

The actual code in _ComputeBatchTimeout (line 293-295) computes Math.Max(30, count * 20) — that's 20 minutes per suite, minimum 30 minutes. This comment was left stale after commit face27d2 ("Fix batch timeout: 30m min, 20m/suite for WASM overhead") updated the code but not the comment.

This is actively misleading for anyone tuning timeouts. The comment should read:

<!-- Compute per-batch timeout: 20 minutes per suite, minimum 30 minutes -->

(Flagged by both Claude and GPT-5.4)


⚠️ Stale Batch Staging Directory — Incremental build risk

src/libraries/sendtohelix-browser.targets:455-469: The batch staging directory ($(IntermediateOutputPath)helix-batches/) is never cleaned before MakeDir/Copy/ZipDirectory. If the target is rerun in the same intermediate output path with a different grouping (e.g., a test archive was added or removed), stale ZIPs from previous runs could remain and get repackaged into batches.

Consider adding a RemoveDir before MakeDir:

<RemoveDir Directories="$(IntermediateOutputPath)helix-batches/" />

This is a low-probability issue in CI (fresh builds), but could cause confusing failures during local iteration.

(Flagged by GPT-5.4)


💡 Dead Conditions in Batched Path — V8 and Firefox branches

src/libraries/sendtohelix-browser.targets:417,423: The batched target _AddBatchedWorkItemsForLibraryTests only runs when WasmBatchLibraryTests == 'true', which is only defaulted to true for RuntimeFlavor == 'CoreCLR' and Scenario == 'WasmTestOnChrome' (lines 44-45). Therefore:

  • The Firefox condition on _WasmBatchWorkItem (line 417) can never be true
  • The V8 condition on _WasmBatchSampleZip (line 423) can never be true

These are harmless but imply unsupported scenarios are handled. Consider removing them or adding a comment that they are placeholders for future expansion.

(Flagged by both Claude and GPT-5.4)


💡 Unused System.Linq Import

src/libraries/sendtohelix-browser.targets:192: The _GroupWorkItems task declares <Using Namespace="System.Linq" /> but the C# code fragment doesn't use any LINQ methods. The Distinct() call is done in MSBuild item transforms (line 451), not in the C# task code. This can be removed.


💡 No Disk Cleanup Between Suites in Batch Runner

eng/testing/WasmBatchRunner.sh:33-34: Each suite is extracted to $BATCH_DIR/$suiteName but never cleaned up after execution. With 8-10 suites per batch, extracted test archives accumulate on disk. This is likely fine for Helix machines with adequate disk space, but worth noting — if suites are large, this could become an issue. A rm -rf "$suiteDir" after popd would free disk between suites.


✅ HelixCommand Preservation — Previous feedback addressed

The HelixCommand.Replace('./RunTests.sh', 'chmod +x WasmBatchRunner.sh && ./WasmBatchRunner.sh') approach (line 473) correctly preserves the full command chain: env var exports, dotnet dev-certs https, and any pre-commands from HelixPreCommand. Since IncludeHelixCorrelationPayload is false for browser targets (line 57), there's no --runtime-path suffix to worry about. The $@ passthrough in WasmBatchRunner.sh (line 43) correctly forwards any arguments to each suite's RunTests.sh.


✅ PayloadArchive Validation — Previous feedback addressed

The batched path correctly creates ZIP archives via ZipDirectory (line 467-469) and references them via PayloadArchive metadata (line 478), satisfying the sendtohelixhelp.proj validation requirement (lines 332-333 of that file).


✅ Bin-Packing Algorithm — Correct and well-structured

The greedy bin-packing in _GroupWorkItems is a sound approach: sort by size descending, assign each item to the smallest current batch. Large items (>50MB) are correctly isolated into their own batch with negative IDs. The _ComputeBatchTimeout correctly scales timeouts proportionally to batch item count.

Generated by Code Review for issue #126157 ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-Infrastructure-coreclr NO-REVIEW Experimental/testing PR, do NOT review it

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants