Skip to content

test: fix canaries-v2#5932

Merged
lucasjia-aws merged 8 commits into
aws:master-v2from
lucasjia-aws:fix/canary-v2
Jun 11, 2026
Merged

test: fix canaries-v2#5932
lucasjia-aws merged 8 commits into
aws:master-v2from
lucasjia-aws:fix/canary-v2

Conversation

@lucasjia-aws

@lucasjia-aws lucasjia-aws commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR fixes a batch of canaries-v2 integration-test failures and trims the runtime of the slowest JumpStart estimator canaries. After this change a full canaries-v2 run completes in 53m 54s (545 passed, 162 skipped, 2 xfailed), down from previously exceeding the three-hours budget.

The failures fell into a few buckets:

  • JumpStart hub tests were not xdist-safe — under high parallelism each worker created its own private hub, exhausting the per-account hub limit (100) and triggering cross-worker cleanup that deleted hubs other workers were still using (Hub ... does not exist / Hub content ... does not exist).
  • Session/settings pollution across packagesModelBuilder mutated a shared session-scoped sagemaker_session, leaking a temp download dir into unrelated workflow tests.
  • Flaky / under-provisioned deploy paths — inference-component endpoints needed a longer timeout, Spark jar builds swallowed failures, and transient capacity shortages surfaced as red canaries.
  • Excessively long training canaries — the bert estimator tests each ran ~100 min because they trained on the full dataset, even though they only need to exercise the train/deploy/predict flow.

Changes by area

1. JumpStart private hub xdist-safety

  • Share a single private hub across all xdist workers via a filelock + JSON state file instead of one hub per worker.
  • Stop deleting the shared hub mid-run. Teardown cleans leaked endpoints/models/configs/artifacts but no longer deletes the hub; stale hubs from prior runs are reclaimed by an age-based _cleanup_old_hubs (STALE_HUB_AGE_HOURS).
  • Add add_model_references_to_hub: idempotently create model references (keyed by hub + model set) and poll until each reference resolves before tests use it.

2. Serve session isolation

  • Override sagemaker_session in tests/integ/sagemaker/serve/conftest.py with a dedicated session so ModelBuilder's _local_download_dir mutation stays contained and no longer breaks unrelated tests sharing the repo-wide session.

3. Stability fixes

  • Spark: run javac/jar with explicit return-code checks and assert the jar exists, so a failed rebuild fails loudly instead of surfacing later as a misleading "code not found" error.
  • Inference component: add a dedicated 30-min endpoint timeout (SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for the 7B-model IC deploy flow.
  • Broaden x_fail_if_ice to also treat InsufficientInstanceCapacity (not just CapacityError) as an expected, transient failure.

4. Runtime reduction

  • Trim training workload on the slowest estimator canaries (details + measured numbers below).

Test modifications

Test File Change Before After
test_jumpstart_estimator jumpstart/estimator/test_jumpstart_estimator.py Dataset full QNLI → QNLI-tiny (via "*" mapping) + hyperparameters={"epochs": "1"} 6049s 665s
test_jumpstart_hub_estimator jumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.py Same dataset change + epochs=1 6064s 663s
test_jumpstart_hub_estimator_with_session jumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.py Same dataset change + epochs=1 5940s 695s
test_gated_model_training_v1 jumpstart/estimator/test_jumpstart_estimator.py hyperparameters={"max_steps": "1"} 2010s 1892s
test_gated_model_training_v2 jumpstart/estimator/test_jumpstart_estimator.py hyperparameters={"max_steps": "1"} 1682s 1591s
test_jumpstart_hub_gated_estimator_with_eula jumpstart/private_hub/estimator/test_jumpstart_private_hub_estimator.py hyperparameters={"max_steps": "1"} 2544s 2231s
test_jumpstart_gated_model_inference_component_enabled jumpstart/model/test_jumpstart_model.py Decorated with @x_fail_if_ice
test_model_builder_ic_sagemaker_endpoint serve/test_serve_model_builder_inference_component_happy.py xfail (non-strict) + dedicated 30-min IC endpoint timeout
test_sagemaker_pyspark_v3 test_spark_processing.py build_jar now checks javac/jar exit codes and asserts jar exists 641s
private hub model tests jumpstart/private_hub/model/test_jumpstart_private_hub_model.py Use shared-hub fixture + add_model_references

Testing

Validated with a full canaries-v2 run: 545 passed, 162 skipped, 2 xfailed in 3234.32s (0:53:54). The previously failing hub/serve/spark tests pass and the estimator canaries no longer dominate the suite.

Under pytest-xdist (-n 120) each worker created its own private hub,
exhausting the per-account hub limit (100) and triggering destructive
cross-worker cleanup that deleted hubs other workers were actively
using, causing "Hub ... does not exist" failures. The add_model_references
fixture also swallowed all errors and did not wait for async reference
propagation, causing "Hub content ... does not exist" failures.

- Share a single hub across all xdist workers via filelock + a JSON
  state file with reference counting; only the last worker tears it down.
- Make _cleanup_old_hubs non-destructive: only delete hubs older than
  STALE_HUB_AGE_HOURS and never the active run's hub.
- Add add_model_references_to_hub helper that creates references
  idempotently (keyed by hub + model set) and polls until each
  reference is resolvable before tests run.
…ngs pollution

ModelBuilder mutates session.settings._local_download_dir to a temporary
/tmp/sagemaker/model-builder/<uuid> path. The serve integ tests passed the
repo-wide session-scoped sagemaker_session fixture into ModelBuilder, so that
mutation leaked across test modules. After the temp dir was cleaned up, the
lingering setting broke unrelated tests sharing the same session, notably
tests/integ/sagemaker/workflow/test_tuning_steps.py::test_tuning_multi_algos
with "ValueError: Inputted directory ... does not exist".

Override sagemaker_session in tests/integ/sagemaker/serve/conftest.py with a
dedicated session (constructed identically to the parent fixture) so the
ModelBuilder mutation stays contained within the serve package.
@lucasjia-aws lucasjia-aws requested a review from a team as a code owner June 5, 2026 22:51
@lucasjia-aws lucasjia-aws requested a review from zhaoqizqwang June 5, 2026 22:51
@lucasjia-aws lucasjia-aws changed the title fix: fix canaries-v2 test: fix canaries-v2 Jun 5, 2026
The previous reference-counted teardown in the session fixture finalizer
was unsafe: pytest-xdist distributes tests dynamically, so a worker could
finish its session (running finalizers) while other workers still had hub
tests pending. Decrementing to zero there deleted the shared hub mid-run,
causing "Hub ... does not exist" / "Hub content ... does not exist"
failures in gated hub tests.

Workers now only create-or-reuse the shared hub (never delete it). Teardown
runs exactly once in pytest_sessionfinish on the controller process (no
workerinput), which is guaranteed to run after all workers finish. Stale
hub reclamation continues to be handled by the age-based _cleanup_old_hubs.
…ut in integ tests

Two unrelated v2 integ-test failures, fixed together:

- test_spark_processing.py::test_sagemaker_pyspark_v3 (Spark 3.x): build_jar
  ran javac/jar without checking exit codes, so a failed jar rebuild (which
  truncates the committed hello-spark-java.jar) was swallowed and surfaced
  later as a misleading "code ... wasn't found" error, especially under xdist
  where the fixture runs per worker. Run the build commands with explicit
  return-code checks and assert the jar exists afterward.

- test_serve_model_builder_inference_component_happy.py::
  test_model_builder_ic_sagemaker_endpoint: deploying a 7B JumpStart model as
  an inference component on ml.g5.24xlarge regularly needs more than the
  15-minute standard endpoint timeout to reach InService (the failure was a
  deploy timeout, not a quota cap). Add a dedicated 30-minute timeout
  (SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for this flow without changing the
  standard serve endpoint timeout.
…st-2#logsV2:log-groups/log-group/$252Faws$252Fcodebuild$252Fsagemaker-python-sdk-ci-integ-tests/log-events/e558697a-488d-4eab-a4ad-2971d9a1081f
…y test

JumpStart hub:
The shared hub was being deleted at session end on the controller, but hub
tests deploy long-lived endpoints, so a straggler worker could still be running
a hub test at ~100% when teardown deleted the hub, causing intermittent
"Hub ... does not exist" failures (e.g. test_jumpstart_hub_gated_estimator_
with_eula). Stop deleting the hub during the run entirely: session-end teardown
still cleans leaked endpoints/models/configs/artifacts but no longer deletes the
hub, and stale hubs from prior runs are reclaimed proactively at setup via the
age-based _cleanup_old_hubs (older than STALE_HUB_AGE_HOURS).

Inference-component serve test:
test_model_builder_ic_sagemaker_endpoint fails in the ModelBuilder IC deploy
path: CreateEndpoint is followed by a DescribeEndpoint that intermittently
reports the endpoint as not found. This is an SDK-level issue, not a test
config problem, so xfail (non-strict) the test to unblock the canary while it
is tracked separately.

X-AI-Prompt: Stop mid-run hub deletion (rely on age-based reclamation) and xfail the flaky ModelBuilder inference-component deploy test
X-AI-Tool: kiro-cli
These canaries only need to exercise the train/deploy/predict flow, not
produce a well-trained model, yet they dominated canary runtime (the
estimator tests each ran ~100 min). Trim the training workload to bring
the suite under one hour while keeping coverage intact.

Bert estimator tests (full QNLI -> QNLI-tiny + epochs=1):
- map the floating "*" version of huggingface-spc-bert-base-cased to the
  QNLI-tiny dataset instead of the full QNLI dataset (constants.py)
- cap training to a single epoch (hyperparameters={"epochs": "1"}) for:
    - test_jumpstart_estimator
    - test_jumpstart_hub_estimator
    - test_jumpstart_hub_estimator_with_session

Gated llama estimator tests (sec_amazon has no tiny variant, so cap steps
via hyperparameters={"max_steps": "1"}):
- test_gated_model_training_v1
- test_gated_model_training_v2
- test_jumpstart_hub_gated_estimator_with_eula

X-AI-Prompt: Reduce JumpStart estimator canary test runtime by using the tiny training dataset and capping epochs/steps so the suite finishes under an hour
X-AI-Tool: kiro-cli
Excludes test_gated_model_training_v2_neuron from ci-integ-tests and
canaries-v2, which both filter out `slow_test`. Trn1/Inf2 capacity makes
this test prone to multi-hour stalls, and max_steps=1 cannot shrink the
provisioning wait.
@lucasjia-aws lucasjia-aws removed the request for review from zhaoqizqwang June 10, 2026 23:41

@papriwal papriwal left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lucasjia-aws lucasjia-aws merged commit d0cbf41 into aws:master-v2 Jun 11, 2026
9 of 11 checks passed
@lucasjia-aws lucasjia-aws deleted the fix/canary-v2 branch June 11, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants