test: add nova tests by lucasjia-aws · Pull Request #5933 · aws/sagemaker-python-sdk

lucasjia-aws · 2026-06-08T08:17:56Z

Overview

This PR adds Nova model-customization deployment integration tests (SageMaker
endpoints and Amazon Bedrock custom models) and fixes a number of pre-existing,
unrelated integ-test failures surfaced once these tests started running in the
integ-tests-us-east-1 PR check.

Two source files in sagemaker-serve are modified. Both are genuine product
bugs in Nova code paths, justified in detail in Part 1 below — they are not
worked around in the tests because doing so would hide functionality the SDK
publicly claims to support.

Part 1 — New Nova tests

New integ test file: sagemaker-serve/tests/integ/test_nova_model_customization_deployment.py
(the Nova counterpart of test_model_customization_deployment.py). Tests are
marked us_east_1 so they run only in the us-east-1 (Nova) account via the
integ-tests-us-east-1 PR check.

New tests

Test class / test	What it covers
`TestModelCustomizationFromTrainingJob` (`test_build_from_training_job`, `test_deploy_from_training_job`, `test_fetch_endpoint_names_for_base_model`)	Build a Nova model from a TrainingJob, deploy it to a SageMaker endpoint, invoke it (Nova messages format), and fetch base-model endpoint names
`TestModelCustomizationFromModelPackage` (`test_build_from_model_package`, `test_deploy_from_model_package`)	Build/deploy via the registered model package and validate the endpoint
`TestInstanceTypeAutoDetection` (`test_instance_type_from_recipe`)	Nova requires an explicit supported instance type (`ml.g6.48xlarge`)
`TestModelCustomizationDetection` (`test_is_model_customization_training_job`, `test_is_model_customization_model_package`, `test_fetch_model_package_arn`)	Model-customization detection and model-package ARN resolution for Nova
`TestTrainerIntegration` (`test_sft_trainer_build`, `test_rlvr_trainer_build`)	`ModelBuilder` accepts an SFT/RLVR trainer object and builds the Nova model (DPO replaced with RLVR — Nova has no DPO recipe in `SageMakerPublicHub`)
`TestNovaBedrockDeployment` (`test_nova_bedrock_deployment_active`, `test_nova_bedrock_invoke`)	Deploy a fine-tuned Nova model to Amazon Bedrock as a custom model (`create_custom_model` + `create_custom_model_deployment`, polling to Active) and invoke it

The training_job_name fixture discovers the latest completed sft-nova-integ-*
job (produced every few hours by the scheduled Nova SFT workflow) whose output
model package still exists, rather than hardcoding a job name that goes stale
when resource cleanup deletes its model package.

Required source changes

File	Change	Why it is necessary (not a test/config issue)
`sagemaker-serve/src/sagemaker/serve/bedrock_model_builder.py`	`_get_checkpoint_uri_from_manifest` locates `manifest.json` from `output_data_config.s3_output_path` + training job name instead of `model_artifacts.s3_model_artifacts`	Nova fine-tuning jobs (`SFTTrainer`/`RLVRTrainer`/`DPOTrainer`) run serverless and never populate `model_artifacts` (no `model.tar.gz`; the field is `Unassigned`), so the old code raised `AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts'` for any Nova job. This is corroborated by three independent sources that all locate the Nova manifest via `output_data_config`: (a) the sibling method `ModelBuilder._resolve_nova_escrow_uri`, (b) the official Nova Studio notebook `sm-studio-nova-training-job-sample-notebook.ipynb`, and (c) the real manifest path verified in the test account. `BedrockModelBuilder` was the only place using `model_artifacts` for Nova — an isolated inconsistency. It cannot be worked around in tests because no Nova job has `model_artifacts`.
`sagemaker-serve/src/sagemaker/serve/model_builder.py`	`_resolve_nova_escrow_uri` resolves the underlying TrainingJob via `_latest_training_job` for `BaseTrainer` instances	`ModelBuilder` publicly supports a trainer object as `model` (type is `Union[..., ModelTrainer, BaseTrainer, TrainingJob, ModelPackage, ...]`), and `SFTTrainer`/`RLVRTrainer`/`DPOTrainer` are `BaseTrainer` subclasses. The sibling methods `_is_model_customization` and `_fetch_model_package_arn` already handle `BaseTrainer`; only `_resolve_nova_escrow_uri` omitted it, so the same `ModelBuilder(model=trainer)` worked for detection/ARN-fetch but failed with "Nova escrow URI resolution requires a TrainingJob or ModelTrainer" on escrow resolution. This is an internal inconsistency; the fix aligns it with the other methods.

Part 2 — Fixes to other (pre-existing) test failures

These failures were already present in the suite and were surfaced/fixed while
bringing up the new tests; some were addressed across earlier iterations.

Area / file	Failing test(s)	Root cause	Fix
`sagemaker-serve/tests/integ/test_model_customization_deployment.py` (OSS)	`test_deploy_from_training_job` and the Bedrock import suite	Deployed endpoints/imported models were not verified or reliably cleaned up; import-job wait was unbounded	Added post-deploy invoke verification, bounded import-job wait with timeout/failure handling, age- and status-aware Bedrock cleanup, retrying invoke, and a yielding `deployed_model_arn` fixture that deletes the imported model
`test_nova_model_customization_deployment.py` (model-package & Bedrock paths)	`test_build/deploy_from_model_package`, `TestNovaBedrockDeployment`	Deploying Nova from a `ModelPackage` is unsupported (escrow artifacts are only resolvable from the TrainingJob manifest; the package is non-RMP)	Drove these via the supported TrainingJob path instead of the ModelPackage
`test_nova_model_customization_deployment.py` (instance type)	`test_instance_type_from_recipe`, `test_sft_trainer_build`, `test_rlvr_trainer_build`	`ModelBuilder` defaulted to `ml.m5.large`, which Nova rejects	Pass the supported `ml.g6.48xlarge`; assert it is used (Nova has no instance-type auto-detection)
`test_nova_model_customization_deployment.py` (capacity)	`test_deploy_from_training_job`, `test_deploy_from_model_package`	Transient region-wide `InsufficientInstanceCapacity` for `ml.g6.48xlarge` (not a quota or code issue)	Added `_deploy_or_skip_on_capacity` helper that skips (rather than fails) on capacity shortage
`sagemaker-mlops/tests/integ/test_feature_store_lakeformation.py`	`test_create_feature_group_and_enable_lake_formation`, `test_create_feature_group_with_lake_formation_enabled`, `test_enable_lake_formation_full_flow_with_policy_output`, `test_enable_lake_formation_default_logs_recommended_policy`	`enable_lake_formation` defaulted to `use_service_linked_role=True`, producing `RegisterResource` with both `WithFederation=True` and `UseServiceLinkedRole=True` — a combination Lake Formation rejects (`InvalidInputException: Unable to register the following path`); all existing registrations in the account used an explicit role	Register with `use_service_linked_role=False, registration_role_arn=role`, matching the supported explicit-role path
`sagemaker-mlops/tests/integ/test_feature_store_lakeformation.py`	`test_enable_lake_formation_fails_with_nonexistent_role`	Negative test asserted the error contains `EntityNotFoundException`, but under a least-privilege `iam:PassRole` policy the failure surfaces as `AccessDeniedException` on `iam:PassRole` before Lake Formation is reached	Accept `EntityNotFoundException`, `AccessDeniedException`, or `iam:PassRole` as valid "role not usable" outcomes
`sagemaker-mlops/tests/integ/test_feature_store.py`	`test_delete_feature_group`	Fixed 2s sleep then a single `get()`; feature-group deletion is asynchronous and stays describable while `Deleting`, causing intermittent "DID NOT RAISE"	Poll `get()` until it raises (group gone) or a 120s timeout

Note: the IAM permission gaps these tests also exposed (build-role iam:PassRole
to Bedrock/Lake Formation, bedrock:CreateCustomModel/custom-model-deployment,
and feature-store S3 bucket-policy/encryption permissions) are CI-infrastructure
changes handled separately in the SageMakerMLFPySDKInfraCDK package, not in
this repo.

Add post-deploy invoke verification and make the Bedrock import-job lifecycle robust in test_model_customization_deployment.py. - Verify deployed endpoints by invoking them and validating the response structure (LORA uses the adapter IC name, otherwise the default base IC). - Replace unconditional stop-all cleanup with age-based (>24h) and status-aware cleanup: stop only InProgress/Pending jobs and delete completed imported models, with logging on failures. - Add a class-scoped autouse cleanup_import_jobs fixture to replace the zzz-prefixed ordering hack. - Bound the import-job wait loop with a 60-minute timeout and fail fast on Failed status; fix importedModelName -> importedModelArn. - Delete the imported model after tests via a yielding deployed_model_arn fixture. - Configure bedrock-runtime with standard retries (10 attempts) and add a slow-marked, retrying test_bedrock_model_invoke to tolerate "model not ready" exceptions. X-AI-Prompt: Write commit message for the us-west-2 model customization deployment test hardening changes X-AI-Tool: kiro-cli

…eMaker) Add a Nova counterpart to test_model_customization_deployment.py covering ModelBuilder deployment of fine-tuned Nova models to SageMaker endpoints, running against the Nova test account in us-east-1 (784379639078). - TestModelCustomizationFromTrainingJob: build, deploy + invoke (Nova messages format), and fetch_endpoint_names_for_base_model. - TestModelCustomizationFromModelPackage: build and deploy from a registered model package. - TestInstanceTypeAutoDetection: instance type auto-detection from recipe. - TestModelCustomizationDetection: customization detection and model package ARN fetch. - TestTrainerIntegration: SFT and RLVR trainer build (DPO replaced with RLVR since Nova has no DPO recipe in SageMakerPublicHub). - Model package is resolved dynamically from the sdk-test-finetuned-models group (latest Completed), mirroring test_benchmark_evaluation_nova_model; dependent tests skip when none exists. - All tests marked us_east_1 so they run in the PR check integ-tests-us-east-1 job (intentionally not gpu_intensive, so they do not run in the scheduled GPU workflow). - Register gpu_intensive and us_east_1 markers in sagemaker-serve/tox.ini. The Bedrock deployment suite is kept commented out for now; the Nova for Bedrock integ tests will be added in a follow-up. X-AI-Prompt: Write commit message for the Nova-for-SageMaker model customization deployment integ tests and marker registration X-AI-Tool: kiro-cli

…g tests Add TestNovaBedrockDeployment covering deployment of a fine-tuned Nova model to Amazon Bedrock via BedrockModelBuilder, complementing the existing Nova-for-SageMaker tests in the same file. - Deploy a Nova model package through BedrockModelBuilder.deploy(), which routes Nova models to create_custom_model + create_custom_model_deployment and polls each resource to Active (vs the create_model_import_job path used for open-weight models). - test_nova_bedrock_deployment_active asserts the deployment reaches Active. - test_nova_bedrock_invoke (slow) invokes the deployed model end-to-end via bedrock-runtime, with standard retries to tolerate the cold-start window. - Model package is resolved dynamically from sdk-test-finetuned-models (latest Completed); deployment fixture cleans up the deployment and custom model afterwards. Role is resolved via get_execution_role(). - Marked us_east_1 (Nova test account, us-east-1) to run in the PR check integ-tests-us-east-1 job; not gpu_intensive. - Replace the previously commented-out OSS-style Bedrock suite (it used the import-job API, which does not apply to Nova) and update the module docstring to describe both SageMaker and Bedrock deployment targets. X-AI-Prompt: Write commit message for the Nova-for-Bedrock model customization deployment integ tests X-AI-Tool: kiro-cli

- Nova deploy/Bedrock tests: build from the TrainingJob instead of a ModelPackage, since Nova escrow artifacts are only resolvable from the training job's manifest (deploying from a ModelPackage is unsupported). - Lake Formation tests: register the S3 location with an explicit role (use_service_linked_role=False) to avoid the WithFederation+SLR combination that Lake Formation rejects.

The training_job_name fixture hardcoded a reusable job whose output model package (sdk-test-nova-finetuned-models/1) was deleted, so every test that resolves the job's output model package failed with "ModelPackage ... does not exist". Discover the latest completed sft-nova-integ-* job at runtime (produced every few hours by the scheduled Nova SFT workflow) and verify its output model package still exists before using it; skip if none is found. This avoids depending on a hardcoded job name that goes stale once resource cleanup deletes its model package. X-AI-Prompt: Replace the hardcoded Nova training job fixture with runtime discovery of the latest completed sft-nova-integ job whose output model package still exists X-AI-Tool: kiro-cli

BedrockModelBuilder._get_checkpoint_uri_from_manifest located manifest.json via self.model.model_artifacts.s3_model_artifacts. Nova fine-tuning jobs produced by SFTTrainer/RLVRTrainer/DPOTrainer run serverless and do not populate model_artifacts (it is Unassigned; there is no model.tar.gz), so deploying a Nova TrainingJob to Bedrock failed with "AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts'". Build the manifest path from output_data_config.s3_output_path and the training job name instead. This aligns with the two other implementations that locate the Nova manifest the same way: - ModelBuilder._resolve_nova_escrow_uri (SageMaker deployment path), and - the official Nova Studio notebook (v3-examples/.../sm-studio-nova-training-job-sample-notebook.ipynb, which derives the manifest from OutputDataConfig.S3OutputPath, not model_artifacts). Verified the derived key is identical to the previous logic when model_artifacts is present, and matches the real manifest location ({s3_output}/{job_name}/output/output/manifest.json) confirmed in the test account. Also update the TestGetCheckpointUri unit tests to mock output_data_config, and keep the Nova Bedrock integ tests driving BedrockModelBuilder from the TrainingJob. X-AI-Prompt: Fix BedrockModelBuilder Nova manifest resolution to use output_data_config (matching ModelBuilder._resolve_nova_escrow_uri and the official Nova Studio notebook) and update unit tests X-AI-Tool: kiro-cli

…y on capacity shortage - _resolve_nova_escrow_uri only accepted TrainingJob/ModelTrainer, so building a Nova model from an SFTTrainer/RLVRTrainer/DPOTrainer (BaseTrainer subclasses) failed with "Nova escrow URI resolution requires a TrainingJob or ModelTrainer". Resolve the underlying job via _latest_training_job for BaseTrainer, matching _is_model_customization and _fetch_model_package_arn. - Nova deploy integ tests could fail with InsufficientInstanceCapacity, a transient region-wide ml.g6.48xlarge availability issue. Add a _deploy_or_skip_on_capacity helper that skips (instead of failing) in that case, used by the training-job and model-package deploy tests. X-AI-Prompt: Support BaseTrainer in _resolve_nova_escrow_uri and skip Nova deploy tests on transient InsufficientInstanceCapacity X-AI-Tool: kiro-cli

lucasjia-aws · 2026-06-08T23:08:10Z

mlops integ tests passed:
https://github.com/aws/sagemaker-python-sdk/actions/runs/27173864423/job/80218605054
+
https://us-west-2.console.aws.amazon.com/codesuite/codebuild/729646638167/projects/sagemaker-python-sdk-ci-sagemaker-mlops-integ-tests/build/sagemaker-python-sdk-ci-sagemaker-mlops-integ-tests%3Ab4d8d442-1f69-472b-bee8-04e53d0340d1?region=us-west-2 (rerun failed one)
+
https://us-west-2.console.aws.amazon.com/codesuite/codebuild/729646638167/projects/sagemaker-python-sdk-ci-sagemaker-mlops-integ-tests/build/sagemaker-python-sdk-ci-sagemaker-mlops-integ-tests%3Aa9b7ddc2-db7d-45e2-ae0e-4cab080e42a9?region=us-west-2 (part two)

…sync FG deletion test_enable_lake_formation_fails_with_nonexistent_role asserted the registration error contains EntityNotFoundException, but under a least-privilege iam:PassRole policy the failure surfaces as an AccessDeniedException on iam:PassRole before Lake Formation is reached. Accept EntityNotFoundException, AccessDeniedException, or iam:PassRole as valid "role not usable" outcomes for this negative test. test_delete_feature_group used a fixed 2s sleep then a single get(), but FeatureGroup deletion is asynchronous and the group stays describable while in Deleting status, causing intermittent "DID NOT RAISE". Poll get() until it raises (group fully gone) or a 120s timeout. X-AI-Prompt: Fix LF nonexistent-role negative test assertion and poll for async feature group deletion X-AI-Tool: kiro-cli

test_nova_bedrock_invoke sent content items as {"type": "text", "text": ...}, which Bedrock rejected with "Malformed input request: #/messages/0/content/0: extraneous key [type] is not permitted". Use the Nova messages-v1 InvokeModel schema instead (content items are {"text": ...} with no type key, plus schemaVersion and inferenceConfig), matching the official Nova Studio notebook, and assert on the Nova response shape output.message.content[0].text. X-AI-Prompt: Fix the Nova Bedrock invoke payload to the messages-v1 schema (no type key) per the official Nova notebook and assert the Nova response structure X-AI-Tool: kiro-cli

lucasjia-aws · 2026-06-09T01:02:38Z

integ test serve passes:
https://github.com/aws/sagemaker-python-sdk/actions/runs/27173864423/job/80218605045
rerun failed one:
https://us-west-2.console.aws.amazon.com/codesuite/codebuild/729646638167/projects/sagemaker-python-sdk-ci-sagemaker-serve-integ-tests/build/sagemaker-python-sdk-ci-sagemaker-serve-integ-tests%3Ab82feae2-1147-48a3-bb36-4fc55b4c99b1?region=us-west-2 (passed)

…kage The training_job_name fixture required the job's output model package to still exist, but the resource cleaner keeps only the oldest and newest package in the group, so every job's package was deleted and all dependent tests skipped. Build/deploy resolve artifacts from the job manifest (not the model package), so just pick the latest completed sft-nova-integ job. X-AI-Prompt: Stop requiring the Nova SFT job's output model package to exist in the fixture so tests stop skipping X-AI-Tool: kiro-cli

ModelBuilder.build fetches the training job's output model package, so the package must exist. Resource cleanup keeps only the oldest and newest package in the group, so picking the latest job left it pointing at a deleted package and every build/deploy test failed. Instead, start from a model package that currently exists and resolve the training job that produced it (parsed from the package's escrow S3 URI), preferring an SFT job. The cleaner always retains the oldest package, so this reliably yields a job whose output package is present. X-AI-Prompt: Resolve the Nova training job by reverse-lookup from an existing model package's escrow S3 URI so build/deploy tests stop failing on deleted packages X-AI-Tool: kiro-cli

* test(serve): harden model customization deployment integ tests Add post-deploy invoke verification and make the Bedrock import-job lifecycle robust in test_model_customization_deployment.py. - Verify deployed endpoints by invoking them and validating the response structure (LORA uses the adapter IC name, otherwise the default base IC). - Replace unconditional stop-all cleanup with age-based (>24h) and status-aware cleanup: stop only InProgress/Pending jobs and delete completed imported models, with logging on failures. - Add a class-scoped autouse cleanup_import_jobs fixture to replace the zzz-prefixed ordering hack. - Bound the import-job wait loop with a 60-minute timeout and fail fast on Failed status; fix importedModelName -> importedModelArn. - Delete the imported model after tests via a yielding deployed_model_arn fixture. - Configure bedrock-runtime with standard retries (10 attempts) and add a slow-marked, retrying test_bedrock_model_invoke to tolerate "model not ready" exceptions. X-AI-Prompt: Write commit message for the us-west-2 model customization deployment test hardening changes X-AI-Tool: kiro-cli * test(serve): add Nova model customization deployment integ tests (SageMaker) Add a Nova counterpart to test_model_customization_deployment.py covering ModelBuilder deployment of fine-tuned Nova models to SageMaker endpoints, running against the Nova test account in us-east-1 (784379639078). - TestModelCustomizationFromTrainingJob: build, deploy + invoke (Nova messages format), and fetch_endpoint_names_for_base_model. - TestModelCustomizationFromModelPackage: build and deploy from a registered model package. - TestInstanceTypeAutoDetection: instance type auto-detection from recipe. - TestModelCustomizationDetection: customization detection and model package ARN fetch. - TestTrainerIntegration: SFT and RLVR trainer build (DPO replaced with RLVR since Nova has no DPO recipe in SageMakerPublicHub). - Model package is resolved dynamically from the sdk-test-finetuned-models group (latest Completed), mirroring test_benchmark_evaluation_nova_model; dependent tests skip when none exists. - All tests marked us_east_1 so they run in the PR check integ-tests-us-east-1 job (intentionally not gpu_intensive, so they do not run in the scheduled GPU workflow). - Register gpu_intensive and us_east_1 markers in sagemaker-serve/tox.ini. The Bedrock deployment suite is kept commented out for now; the Nova for Bedrock integ tests will be added in a follow-up. X-AI-Prompt: Write commit message for the Nova-for-SageMaker model customization deployment integ tests and marker registration X-AI-Tool: kiro-cli * test(serve): add Nova for Bedrock model customization deployment integ tests Add TestNovaBedrockDeployment covering deployment of a fine-tuned Nova model to Amazon Bedrock via BedrockModelBuilder, complementing the existing Nova-for-SageMaker tests in the same file. - Deploy a Nova model package through BedrockModelBuilder.deploy(), which routes Nova models to create_custom_model + create_custom_model_deployment and polls each resource to Active (vs the create_model_import_job path used for open-weight models). - test_nova_bedrock_deployment_active asserts the deployment reaches Active. - test_nova_bedrock_invoke (slow) invokes the deployed model end-to-end via bedrock-runtime, with standard retries to tolerate the cold-start window. - Model package is resolved dynamically from sdk-test-finetuned-models (latest Completed); deployment fixture cleans up the deployment and custom model afterwards. Role is resolved via get_execution_role(). - Marked us_east_1 (Nova test account, us-east-1) to run in the PR check integ-tests-us-east-1 job; not gpu_intensive. - Replace the previously commented-out OSS-style Bedrock suite (it used the import-job API, which does not apply to Nova) and update the module docstring to describe both SageMaker and Bedrock deployment targets. X-AI-Prompt: Write commit message for the Nova-for-Bedrock model customization deployment integ tests X-AI-Tool: kiro-cli * test: fix Nova deployment and Lake Formation integ tests - Nova deploy/Bedrock tests: build from the TrainingJob instead of a ModelPackage, since Nova escrow artifacts are only resolvable from the training job's manifest (deploying from a ModelPackage is unsupported). - Lake Formation tests: register the S3 location with an explicit role (use_service_linked_role=False) to avoid the WithFederation+SLR combination that Lake Formation rejects. * test(serve): discover Nova SFT training job dynamically The training_job_name fixture hardcoded a reusable job whose output model package (sdk-test-nova-finetuned-models/1) was deleted, so every test that resolves the job's output model package failed with "ModelPackage ... does not exist". Discover the latest completed sft-nova-integ-* job at runtime (produced every few hours by the scheduled Nova SFT workflow) and verify its output model package still exists before using it; skip if none is found. This avoids depending on a hardcoded job name that goes stale once resource cleanup deletes its model package. X-AI-Prompt: Replace the hardcoded Nova training job fixture with runtime discovery of the latest completed sft-nova-integ job whose output model package still exists X-AI-Tool: kiro-cli * fix(serve): resolve Nova Bedrock manifest from output_data_config BedrockModelBuilder._get_checkpoint_uri_from_manifest located manifest.json via self.model.model_artifacts.s3_model_artifacts. Nova fine-tuning jobs produced by SFTTrainer/RLVRTrainer/DPOTrainer run serverless and do not populate model_artifacts (it is Unassigned; there is no model.tar.gz), so deploying a Nova TrainingJob to Bedrock failed with "AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts'". Build the manifest path from output_data_config.s3_output_path and the training job name instead. This aligns with the two other implementations that locate the Nova manifest the same way: - ModelBuilder._resolve_nova_escrow_uri (SageMaker deployment path), and - the official Nova Studio notebook (v3-examples/.../sm-studio-nova-training-job-sample-notebook.ipynb, which derives the manifest from OutputDataConfig.S3OutputPath, not model_artifacts). Verified the derived key is identical to the previous logic when model_artifacts is present, and matches the real manifest location ({s3_output}/{job_name}/output/output/manifest.json) confirmed in the test account. Also update the TestGetCheckpointUri unit tests to mock output_data_config, and keep the Nova Bedrock integ tests driving BedrockModelBuilder from the TrainingJob. X-AI-Prompt: Fix BedrockModelBuilder Nova manifest resolution to use output_data_config (matching ModelBuilder._resolve_nova_escrow_uri and the official Nova Studio notebook) and update unit tests X-AI-Tool: kiro-cli * fix(serve): support BaseTrainer in Nova escrow resolution; skip deploy on capacity shortage - _resolve_nova_escrow_uri only accepted TrainingJob/ModelTrainer, so building a Nova model from an SFTTrainer/RLVRTrainer/DPOTrainer (BaseTrainer subclasses) failed with "Nova escrow URI resolution requires a TrainingJob or ModelTrainer". Resolve the underlying job via _latest_training_job for BaseTrainer, matching _is_model_customization and _fetch_model_package_arn. - Nova deploy integ tests could fail with InsufficientInstanceCapacity, a transient region-wide ml.g6.48xlarge availability issue. Add a _deploy_or_skip_on_capacity helper that skips (instead of failing) in that case, used by the training-job and model-package deploy tests. X-AI-Prompt: Support BaseTrainer in _resolve_nova_escrow_uri and skip Nova deploy tests on transient InsufficientInstanceCapacity X-AI-Tool: kiro-cli * Fix flaky feature store integ tests: LF negative-role assertion and async FG deletion test_enable_lake_formation_fails_with_nonexistent_role asserted the registration error contains EntityNotFoundException, but under a least-privilege iam:PassRole policy the failure surfaces as an AccessDeniedException on iam:PassRole before Lake Formation is reached. Accept EntityNotFoundException, AccessDeniedException, or iam:PassRole as valid "role not usable" outcomes for this negative test. test_delete_feature_group used a fixed 2s sleep then a single get(), but FeatureGroup deletion is asynchronous and the group stays describable while in Deleting status, causing intermittent "DID NOT RAISE". Poll get() until it raises (group fully gone) or a 120s timeout. X-AI-Prompt: Fix LF nonexistent-role negative test assertion and poll for async feature group deletion X-AI-Tool: kiro-cli * test(serve): use Nova messages-v1 schema for Bedrock invoke test_nova_bedrock_invoke sent content items as {"type": "text", "text": ...}, which Bedrock rejected with "Malformed input request: #/messages/0/content/0: extraneous key [type] is not permitted". Use the Nova messages-v1 InvokeModel schema instead (content items are {"text": ...} with no type key, plus schemaVersion and inferenceConfig), matching the official Nova Studio notebook, and assert on the Nova response shape output.message.content[0].text. X-AI-Prompt: Fix the Nova Bedrock invoke payload to the messages-v1 schema (no type key) per the official Nova notebook and assert the Nova response structure X-AI-Tool: kiro-cli * chore(serve): trim verbose comments * test(serve): pick latest Nova SFT job without requiring its model package The training_job_name fixture required the job's output model package to still exist, but the resource cleaner keeps only the oldest and newest package in the group, so every job's package was deleted and all dependent tests skipped. Build/deploy resolve artifacts from the job manifest (not the model package), so just pick the latest completed sft-nova-integ job. X-AI-Prompt: Stop requiring the Nova SFT job's output model package to exist in the fixture so tests stop skipping X-AI-Tool: kiro-cli * test(serve): resolve Nova training job from an existing model package ModelBuilder.build fetches the training job's output model package, so the package must exist. Resource cleanup keeps only the oldest and newest package in the group, so picking the latest job left it pointing at a deleted package and every build/deploy test failed. Instead, start from a model package that currently exists and resolve the training job that produced it (parsed from the package's escrow S3 URI), preferring an SFT job. The cleaner always retains the oldest package, so this reliably yields a job whose output package is present. X-AI-Prompt: Resolve the Nova training job by reverse-lookup from an existing model package's escrow S3 URI so build/deploy tests stop failing on deleted packages X-AI-Tool: kiro-cli

lucasjia-aws added 3 commits June 8, 2026 00:29

lucasjia-aws temporarily deployed to auto-approve June 8, 2026 08:18 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 8, 2026 20:14 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 8, 2026 20:45 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 8, 2026 20:46 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 8, 2026 21:44 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 8, 2026 22:36 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 8, 2026 23:36 — with GitHub Actions Inactive

aviruthen previously approved these changes Jun 8, 2026

View reviewed changes

lucasjia-aws added 2 commits June 8, 2026 17:26

chore(serve): trim verbose comments

dd2bef8

lucasjia-aws dismissed aviruthen’s stale review via dd2bef8 June 9, 2026 00:43

lucasjia-aws temporarily deployed to auto-approve June 9, 2026 00:44 — with GitHub Actions Inactive

aviruthen previously approved these changes Jun 9, 2026

View reviewed changes

lucasjia-aws dismissed aviruthen’s stale review via f023b87 June 9, 2026 01:12

lucasjia-aws temporarily deployed to auto-approve June 9, 2026 01:12 — with GitHub Actions Inactive

lucasjia-aws temporarily deployed to auto-approve June 9, 2026 01:25 — with GitHub Actions Inactive

aviruthen approved these changes Jun 9, 2026

View reviewed changes

lucasjia-aws merged commit 63ac789 into aws:master Jun 9, 2026
32 of 48 checks passed

lucasjia-aws deleted the test/nova_tests branch June 11, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add nova tests#5933

test: add nova tests#5933
lucasjia-aws merged 12 commits into
aws:masterfrom
lucasjia-aws:test/nova_tests

lucasjia-aws commented Jun 8, 2026 •

edited

Loading

Uh oh!

lucasjia-aws commented Jun 8, 2026 •

edited

Loading

Uh oh!

lucasjia-aws commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucasjia-aws commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Part 1 — New Nova tests

New tests

Required source changes

Part 2 — Fixes to other (pre-existing) test failures

Uh oh!

lucasjia-aws commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasjia-aws commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lucasjia-aws commented Jun 8, 2026 •

edited

Loading

lucasjia-aws commented Jun 8, 2026 •

edited

Loading

lucasjia-aws commented Jun 9, 2026 •

edited

Loading