test: fix canaries-v2#5932
Merged
Merged
Conversation
Under pytest-xdist (-n 120) each worker created its own private hub, exhausting the per-account hub limit (100) and triggering destructive cross-worker cleanup that deleted hubs other workers were actively using, causing "Hub ... does not exist" failures. The add_model_references fixture also swallowed all errors and did not wait for async reference propagation, causing "Hub content ... does not exist" failures. - Share a single hub across all xdist workers via filelock + a JSON state file with reference counting; only the last worker tears it down. - Make _cleanup_old_hubs non-destructive: only delete hubs older than STALE_HUB_AGE_HOURS and never the active run's hub. - Add add_model_references_to_hub helper that creates references idempotently (keyed by hub + model set) and polls until each reference is resolvable before tests run.
…ngs pollution ModelBuilder mutates session.settings._local_download_dir to a temporary /tmp/sagemaker/model-builder/<uuid> path. The serve integ tests passed the repo-wide session-scoped sagemaker_session fixture into ModelBuilder, so that mutation leaked across test modules. After the temp dir was cleaned up, the lingering setting broke unrelated tests sharing the same session, notably tests/integ/sagemaker/workflow/test_tuning_steps.py::test_tuning_multi_algos with "ValueError: Inputted directory ... does not exist". Override sagemaker_session in tests/integ/sagemaker/serve/conftest.py with a dedicated session (constructed identically to the parent fixture) so the ModelBuilder mutation stays contained within the serve package.
The previous reference-counted teardown in the session fixture finalizer was unsafe: pytest-xdist distributes tests dynamically, so a worker could finish its session (running finalizers) while other workers still had hub tests pending. Decrementing to zero there deleted the shared hub mid-run, causing "Hub ... does not exist" / "Hub content ... does not exist" failures in gated hub tests. Workers now only create-or-reuse the shared hub (never delete it). Teardown runs exactly once in pytest_sessionfinish on the controller process (no workerinput), which is guaranteed to run after all workers finish. Stale hub reclamation continues to be handled by the age-based _cleanup_old_hubs.
…ut in integ tests Two unrelated v2 integ-test failures, fixed together: - test_spark_processing.py::test_sagemaker_pyspark_v3 (Spark 3.x): build_jar ran javac/jar without checking exit codes, so a failed jar rebuild (which truncates the committed hello-spark-java.jar) was swallowed and surfaced later as a misleading "code ... wasn't found" error, especially under xdist where the fixture runs per worker. Run the build commands with explicit return-code checks and assert the jar exists afterward. - test_serve_model_builder_inference_component_happy.py:: test_model_builder_ic_sagemaker_endpoint: deploying a 7B JumpStart model as an inference component on ml.g5.24xlarge regularly needs more than the 15-minute standard endpoint timeout to reach InService (the failure was a deploy timeout, not a quota cap). Add a dedicated 30-minute timeout (SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for this flow without changing the standard serve endpoint timeout.
…st-2#logsV2:log-groups/log-group/$252Faws$252Fcodebuild$252Fsagemaker-python-sdk-ci-integ-tests/log-events/e558697a-488d-4eab-a4ad-2971d9a1081f
…y test JumpStart hub: The shared hub was being deleted at session end on the controller, but hub tests deploy long-lived endpoints, so a straggler worker could still be running a hub test at ~100% when teardown deleted the hub, causing intermittent "Hub ... does not exist" failures (e.g. test_jumpstart_hub_gated_estimator_ with_eula). Stop deleting the hub during the run entirely: session-end teardown still cleans leaked endpoints/models/configs/artifacts but no longer deletes the hub, and stale hubs from prior runs are reclaimed proactively at setup via the age-based _cleanup_old_hubs (older than STALE_HUB_AGE_HOURS). Inference-component serve test: test_model_builder_ic_sagemaker_endpoint fails in the ModelBuilder IC deploy path: CreateEndpoint is followed by a DescribeEndpoint that intermittently reports the endpoint as not found. This is an SDK-level issue, not a test config problem, so xfail (non-strict) the test to unblock the canary while it is tracked separately. X-AI-Prompt: Stop mid-run hub deletion (rely on age-based reclamation) and xfail the flaky ModelBuilder inference-component deploy test X-AI-Tool: kiro-cli
These canaries only need to exercise the train/deploy/predict flow, not
produce a well-trained model, yet they dominated canary runtime (the
estimator tests each ran ~100 min). Trim the training workload to bring
the suite under one hour while keeping coverage intact.
Bert estimator tests (full QNLI -> QNLI-tiny + epochs=1):
- map the floating "*" version of huggingface-spc-bert-base-cased to the
QNLI-tiny dataset instead of the full QNLI dataset (constants.py)
- cap training to a single epoch (hyperparameters={"epochs": "1"}) for:
- test_jumpstart_estimator
- test_jumpstart_hub_estimator
- test_jumpstart_hub_estimator_with_session
Gated llama estimator tests (sec_amazon has no tiny variant, so cap steps
via hyperparameters={"max_steps": "1"}):
- test_gated_model_training_v1
- test_gated_model_training_v2
- test_jumpstart_hub_gated_estimator_with_eula
X-AI-Prompt: Reduce JumpStart estimator canary test runtime by using the tiny training dataset and capping epochs/steps so the suite finishes under an hour
X-AI-Tool: kiro-cli
Excludes test_gated_model_training_v2_neuron from ci-integ-tests and canaries-v2, which both filter out `slow_test`. Trn1/Inf2 capacity makes this test prone to multi-hour stalls, and max_steps=1 cannot shrink the provisioning wait.
aviruthen
approved these changes
Jun 10, 2026
aviruthen
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a batch of
canaries-v2integration-test failures and trims the runtime of the slowest JumpStart estimator canaries. After this change a fullcanaries-v2run completes in 53m 54s (545 passed, 162 skipped, 2 xfailed), down from previously exceeding the three-hours budget.The failures fell into a few buckets:
Hub ... does not exist/Hub content ... does not exist).ModelBuildermutated a shared session-scopedsagemaker_session, leaking a temp download dir into unrelated workflow tests.Changes by area
1. JumpStart private hub xdist-safety
_cleanup_old_hubs(STALE_HUB_AGE_HOURS).add_model_references_to_hub: idempotently create model references (keyed by hub + model set) and poll until each reference resolves before tests use it.2. Serve session isolation
sagemaker_sessionintests/integ/sagemaker/serve/conftest.pywith a dedicated session soModelBuilder's_local_download_dirmutation stays contained and no longer breaks unrelated tests sharing the repo-wide session.3. Stability fixes
javac/jarwith explicit return-code checks and assert the jar exists, so a failed rebuild fails loudly instead of surfacing later as a misleading "code not found" error.SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for the 7B-model IC deploy flow.x_fail_if_iceto also treatInsufficientInstanceCapacity(not justCapacityError) as an expected, transient failure.4. Runtime reduction
Test modifications
test_jumpstart_estimatorjumpstart/estimator/test_jumpstart_estimator.py"*"mapping) +hyperparameters={"epochs": "1"}test_jumpstart_hub_estimatorjumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.pyepochs=1test_jumpstart_hub_estimator_with_sessionjumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.pyepochs=1test_gated_model_training_v1jumpstart/estimator/test_jumpstart_estimator.pyhyperparameters={"max_steps": "1"}test_gated_model_training_v2jumpstart/estimator/test_jumpstart_estimator.pyhyperparameters={"max_steps": "1"}test_jumpstart_hub_gated_estimator_with_eulajumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.pyhyperparameters={"max_steps": "1"}test_jumpstart_gated_model_inference_component_enabledjumpstart/model/test_jumpstart_model.py@x_fail_if_icetest_model_builder_ic_sagemaker_endpointserve/test_serve_model_builder_inference_component_happy.pyxfail(non-strict) + dedicated 30-min IC endpoint timeouttest_sagemaker_pyspark_v3test_spark_processing.pybuild_jarnow checksjavac/jarexit codes and asserts jar existsjumpstart/private_hub/model/test_jumpstart_private_hub_model.pyadd_model_referencesTesting
Validated with a full
canaries-v2run: 545 passed, 162 skipped, 2 xfailed in 3234.32s (0:53:54). The previously failing hub/serve/spark tests pass and the estimator canaries no longer dominate the suite.