test: fix canaries-v2 by lucasjia-aws · Pull Request #5932 · aws/sagemaker-python-sdk

lucasjia-aws · 2026-06-05T22:51:16Z

Summary

This PR fixes a batch of canaries-v2 integration-test failures and trims the runtime of the slowest JumpStart estimator canaries. After this change a full canaries-v2 run completes in 53m 54s (545 passed, 162 skipped, 2 xfailed), down from previously exceeding the three-hours budget.

The failures fell into a few buckets:

JumpStart hub tests were not xdist-safe — under high parallelism each worker created its own private hub, exhausting the per-account hub limit (100) and triggering cross-worker cleanup that deleted hubs other workers were still using (Hub ... does not exist / Hub content ... does not exist).
Session/settings pollution across packages — ModelBuilder mutated a shared session-scoped sagemaker_session, leaking a temp download dir into unrelated workflow tests.
Flaky / under-provisioned deploy paths — inference-component endpoints needed a longer timeout, Spark jar builds swallowed failures, and transient capacity shortages surfaced as red canaries.
Excessively long training canaries — the bert estimator tests each ran ~100 min because they trained on the full dataset, even though they only need to exercise the train/deploy/predict flow.

Changes by area

1. JumpStart private hub xdist-safety

Share a single private hub across all xdist workers via a filelock + JSON state file instead of one hub per worker.
Stop deleting the shared hub mid-run. Teardown cleans leaked endpoints/models/configs/artifacts but no longer deletes the hub; stale hubs from prior runs are reclaimed by an age-based _cleanup_old_hubs (STALE_HUB_AGE_HOURS).
Add add_model_references_to_hub: idempotently create model references (keyed by hub + model set) and poll until each reference resolves before tests use it.

2. Serve session isolation

Override sagemaker_session in tests/integ/sagemaker/serve/conftest.py with a dedicated session so ModelBuilder's _local_download_dir mutation stays contained and no longer breaks unrelated tests sharing the repo-wide session.

3. Stability fixes

Spark: run javac/jar with explicit return-code checks and assert the jar exists, so a failed rebuild fails loudly instead of surfacing later as a misleading "code not found" error.
Inference component: add a dedicated 30-min endpoint timeout (SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for the 7B-model IC deploy flow.
Broaden x_fail_if_ice to also treat InsufficientInstanceCapacity (not just CapacityError) as an expected, transient failure.

4. Runtime reduction

Trim training workload on the slowest estimator canaries (details + measured numbers below).

Test modifications

Test	File	Change	Before	After
`test_jumpstart_estimator`	`jumpstart/estimator/test_jumpstart_estimator.py`	Dataset full QNLI → QNLI-tiny (via `"*"` mapping) + `hyperparameters={"epochs": "1"}`	6049s	665s
`test_jumpstart_hub_estimator`	`jumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.py`	Same dataset change + `epochs=1`	6064s	663s
`test_jumpstart_hub_estimator_with_session`	`jumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.py`	Same dataset change + `epochs=1`	5940s	695s
`test_gated_model_training_v1`	`jumpstart/estimator/test_jumpstart_estimator.py`	`hyperparameters={"max_steps": "1"}`	2010s	1892s
`test_gated_model_training_v2`	`jumpstart/estimator/test_jumpstart_estimator.py`	`hyperparameters={"max_steps": "1"}`	1682s	1591s
`test_jumpstart_hub_gated_estimator_with_eula`	`jumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.py`	`hyperparameters={"max_steps": "1"}`	2544s	2231s
`test_jumpstart_gated_model_inference_component_enabled`	`jumpstart/model/test_jumpstart_model.py`	Decorated with `@x_fail_if_ice`	—	—
`test_model_builder_ic_sagemaker_endpoint`	`serve/test_serve_model_builder_inference_component_happy.py`	`xfail` (non-strict) + dedicated 30-min IC endpoint timeout	—	—
`test_sagemaker_pyspark_v3`	`test_spark_processing.py`	`build_jar` now checks `javac`/`jar` exit codes and asserts jar exists	—	641s
private hub model tests	`jumpstart/private_hub/model/test_jumpstart_private_hub_model.py`	Use shared-hub fixture + `add_model_references`	—	—

Testing

Validated with a full canaries-v2 run: 545 passed, 162 skipped, 2 xfailed in 3234.32s (0:53:54). The previously failing hub/serve/spark tests pass and the estimator canaries no longer dominate the suite.

Under pytest-xdist (-n 120) each worker created its own private hub, exhausting the per-account hub limit (100) and triggering destructive cross-worker cleanup that deleted hubs other workers were actively using, causing "Hub ... does not exist" failures. The add_model_references fixture also swallowed all errors and did not wait for async reference propagation, causing "Hub content ... does not exist" failures. - Share a single hub across all xdist workers via filelock + a JSON state file with reference counting; only the last worker tears it down. - Make _cleanup_old_hubs non-destructive: only delete hubs older than STALE_HUB_AGE_HOURS and never the active run's hub. - Add add_model_references_to_hub helper that creates references idempotently (keyed by hub + model set) and polls until each reference is resolvable before tests run.

…ngs pollution ModelBuilder mutates session.settings._local_download_dir to a temporary /tmp/sagemaker/model-builder/<uuid> path. The serve integ tests passed the repo-wide session-scoped sagemaker_session fixture into ModelBuilder, so that mutation leaked across test modules. After the temp dir was cleaned up, the lingering setting broke unrelated tests sharing the same session, notably tests/integ/sagemaker/workflow/test_tuning_steps.py::test_tuning_multi_algos with "ValueError: Inputted directory ... does not exist". Override sagemaker_session in tests/integ/sagemaker/serve/conftest.py with a dedicated session (constructed identically to the parent fixture) so the ModelBuilder mutation stays contained within the serve package.

The previous reference-counted teardown in the session fixture finalizer was unsafe: pytest-xdist distributes tests dynamically, so a worker could finish its session (running finalizers) while other workers still had hub tests pending. Decrementing to zero there deleted the shared hub mid-run, causing "Hub ... does not exist" / "Hub content ... does not exist" failures in gated hub tests. Workers now only create-or-reuse the shared hub (never delete it). Teardown runs exactly once in pytest_sessionfinish on the controller process (no workerinput), which is guaranteed to run after all workers finish. Stale hub reclamation continues to be handled by the age-based _cleanup_old_hubs.

…ut in integ tests Two unrelated v2 integ-test failures, fixed together: - test_spark_processing.py::test_sagemaker_pyspark_v3 (Spark 3.x): build_jar ran javac/jar without checking exit codes, so a failed jar rebuild (which truncates the committed hello-spark-java.jar) was swallowed and surfaced later as a misleading "code ... wasn't found" error, especially under xdist where the fixture runs per worker. Run the build commands with explicit return-code checks and assert the jar exists afterward. - test_serve_model_builder_inference_component_happy.py:: test_model_builder_ic_sagemaker_endpoint: deploying a 7B JumpStart model as an inference component on ml.g5.24xlarge regularly needs more than the 15-minute standard endpoint timeout to reach InService (the failure was a deploy timeout, not a quota cap). Add a dedicated 30-minute timeout (SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for this flow without changing the standard serve endpoint timeout.

…st-2#logsV2:log-groups/log-group/$252Faws$252Fcodebuild$252Fsagemaker-python-sdk-ci-integ-tests/log-events/e558697a-488d-4eab-a4ad-2971d9a1081f

…y test JumpStart hub: The shared hub was being deleted at session end on the controller, but hub tests deploy long-lived endpoints, so a straggler worker could still be running a hub test at ~100% when teardown deleted the hub, causing intermittent "Hub ... does not exist" failures (e.g. test_jumpstart_hub_gated_estimator_ with_eula). Stop deleting the hub during the run entirely: session-end teardown still cleans leaked endpoints/models/configs/artifacts but no longer deletes the hub, and stale hubs from prior runs are reclaimed proactively at setup via the age-based _cleanup_old_hubs (older than STALE_HUB_AGE_HOURS). Inference-component serve test: test_model_builder_ic_sagemaker_endpoint fails in the ModelBuilder IC deploy path: CreateEndpoint is followed by a DescribeEndpoint that intermittently reports the endpoint as not found. This is an SDK-level issue, not a test config problem, so xfail (non-strict) the test to unblock the canary while it is tracked separately. X-AI-Prompt: Stop mid-run hub deletion (rely on age-based reclamation) and xfail the flaky ModelBuilder inference-component deploy test X-AI-Tool: kiro-cli

These canaries only need to exercise the train/deploy/predict flow, not produce a well-trained model, yet they dominated canary runtime (the estimator tests each ran ~100 min). Trim the training workload to bring the suite under one hour while keeping coverage intact. Bert estimator tests (full QNLI -> QNLI-tiny + epochs=1): - map the floating "*" version of huggingface-spc-bert-base-cased to the QNLI-tiny dataset instead of the full QNLI dataset (constants.py) - cap training to a single epoch (hyperparameters={"epochs": "1"}) for: - test_jumpstart_estimator - test_jumpstart_hub_estimator - test_jumpstart_hub_estimator_with_session Gated llama estimator tests (sec_amazon has no tiny variant, so cap steps via hyperparameters={"max_steps": "1"}): - test_gated_model_training_v1 - test_gated_model_training_v2 - test_jumpstart_hub_gated_estimator_with_eula X-AI-Prompt: Reduce JumpStart estimator canary test runtime by using the tiny training dataset and capping epochs/steps so the suite finishes under an hour X-AI-Tool: kiro-cli

Excludes test_gated_model_training_v2_neuron from ci-integ-tests and canaries-v2, which both filter out `slow_test`. Trn1/Inf2 capacity makes this test prone to multi-hour stalls, and max_steps=1 cannot shrink the provisioning wait.

papriwal

LGTM!

lucasjia-aws added 2 commits June 5, 2026 15:35

lucasjia-aws requested a review from a team as a code owner June 5, 2026 22:51

lucasjia-aws requested a review from zhaoqizqwang June 5, 2026 22:51

lucasjia-aws temporarily deployed to auto-approve June 5, 2026 22:51 — with GitHub Actions Inactive

lucasjia-aws changed the title ~~fix: fix canaries-v2~~ test: fix canaries-v2 Jun 5, 2026

lucasjia-aws temporarily deployed to auto-approve June 6, 2026 06:28 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 6, 2026 06:47 — with GitHub Actions Inactive

https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-we…

d912e41

…st-2#logsV2:log-groups/log-group/$252Faws$252Fcodebuild$252Fsagemaker-python-sdk-ci-integ-tests/log-events/e558697a-488d-4eab-a4ad-2971d9a1081f

lucasjia-aws temporarily deployed to auto-approve June 7, 2026 03:31 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 7, 2026 08:00 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 9, 2026 19:17 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 10, 2026 07:18 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 10, 2026 21:14 — with GitHub Actions Inactive

aviruthen approved these changes Jun 10, 2026

View reviewed changes

lucasjia-aws removed the request for review from zhaoqizqwang June 10, 2026 23:41

papriwal approved these changes Jun 11, 2026

View reviewed changes

lucasjia-aws merged commit d0cbf41 into aws:master-v2 Jun 11, 2026
9 of 11 checks passed

lucasjia-aws deleted the fix/canary-v2 branch June 11, 2026 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: fix canaries-v2#5932

test: fix canaries-v2#5932
lucasjia-aws merged 8 commits into
aws:master-v2from
lucasjia-aws:fix/canary-v2

lucasjia-aws commented Jun 5, 2026 •

edited

Loading

Uh oh!

papriwal left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

lucasjia-aws commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes by area

1. JumpStart private hub xdist-safety

2. Serve session isolation

3. Stability fixes

4. Runtime reduction

Test modifications

Testing

Uh oh!

papriwal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucasjia-aws commented Jun 5, 2026 •

edited

Loading