Skip to content

test: add nova tests#5933

Merged
lucasjia-aws merged 12 commits into
aws:masterfrom
lucasjia-aws:test/nova_tests
Jun 9, 2026
Merged

test: add nova tests#5933
lucasjia-aws merged 12 commits into
aws:masterfrom
lucasjia-aws:test/nova_tests

Conversation

@lucasjia-aws

@lucasjia-aws lucasjia-aws commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Overview

This PR adds Nova model-customization deployment integration tests (SageMaker
endpoints and Amazon Bedrock custom models) and fixes a number of pre-existing,
unrelated integ-test failures surfaced once these tests started running in the
integ-tests-us-east-1 PR check.

Two source files in sagemaker-serve are modified. Both are genuine product
bugs in Nova code paths, justified in detail in Part 1 below — they are not
worked around in the tests because doing so would hide functionality the SDK
publicly claims to support.


Part 1 — New Nova tests

New integ test file: sagemaker-serve/tests/integ/test_nova_model_customization_deployment.py
(the Nova counterpart of test_model_customization_deployment.py). Tests are
marked us_east_1 so they run only in the us-east-1 (Nova) account via the
integ-tests-us-east-1 PR check.

New tests

Test class / test What it covers
TestModelCustomizationFromTrainingJob (test_build_from_training_job, test_deploy_from_training_job, test_fetch_endpoint_names_for_base_model) Build a Nova model from a TrainingJob, deploy it to a SageMaker endpoint, invoke it (Nova messages format), and fetch base-model endpoint names
TestModelCustomizationFromModelPackage (test_build_from_model_package, test_deploy_from_model_package) Build/deploy via the registered model package and validate the endpoint
TestInstanceTypeAutoDetection (test_instance_type_from_recipe) Nova requires an explicit supported instance type (ml.g6.48xlarge)
TestModelCustomizationDetection (test_is_model_customization_training_job, test_is_model_customization_model_package, test_fetch_model_package_arn) Model-customization detection and model-package ARN resolution for Nova
TestTrainerIntegration (test_sft_trainer_build, test_rlvr_trainer_build) ModelBuilder accepts an SFT/RLVR trainer object and builds the Nova model (DPO replaced with RLVR — Nova has no DPO recipe in SageMakerPublicHub)
TestNovaBedrockDeployment (test_nova_bedrock_deployment_active, test_nova_bedrock_invoke) Deploy a fine-tuned Nova model to Amazon Bedrock as a custom model (create_custom_model + create_custom_model_deployment, polling to Active) and invoke it

The training_job_name fixture discovers the latest completed sft-nova-integ-*
job (produced every few hours by the scheduled Nova SFT workflow) whose output
model package still exists, rather than hardcoding a job name that goes stale
when resource cleanup deletes its model package.

Required source changes

File Change Why it is necessary (not a test/config issue)
sagemaker-serve/src/sagemaker/serve/bedrock_model_builder.py _get_checkpoint_uri_from_manifest locates manifest.json from output_data_config.s3_output_path + training job name instead of model_artifacts.s3_model_artifacts Nova fine-tuning jobs (SFTTrainer/RLVRTrainer/DPOTrainer) run serverless and never populate model_artifacts (no model.tar.gz; the field is Unassigned), so the old code raised AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts' for any Nova job. This is corroborated by three independent sources that all locate the Nova manifest via output_data_config: (a) the sibling method ModelBuilder._resolve_nova_escrow_uri, (b) the official Nova Studio notebook sm-studio-nova-training-job-sample-notebook.ipynb, and (c) the real manifest path verified in the test account. BedrockModelBuilder was the only place using model_artifacts for Nova — an isolated inconsistency. It cannot be worked around in tests because no Nova job has model_artifacts.
sagemaker-serve/src/sagemaker/serve/model_builder.py _resolve_nova_escrow_uri resolves the underlying TrainingJob via _latest_training_job for BaseTrainer instances ModelBuilder publicly supports a trainer object as model (type is Union[..., ModelTrainer, BaseTrainer, TrainingJob, ModelPackage, ...]), and SFTTrainer/RLVRTrainer/DPOTrainer are BaseTrainer subclasses. The sibling methods _is_model_customization and _fetch_model_package_arn already handle BaseTrainer; only _resolve_nova_escrow_uri omitted it, so the same ModelBuilder(model=trainer) worked for detection/ARN-fetch but failed with "Nova escrow URI resolution requires a TrainingJob or ModelTrainer" on escrow resolution. This is an internal inconsistency; the fix aligns it with the other methods.

Part 2 — Fixes to other (pre-existing) test failures

These failures were already present in the suite and were surfaced/fixed while
bringing up the new tests; some were addressed across earlier iterations.

Area / file Failing test(s) Root cause Fix
sagemaker-serve/tests/integ/test_model_customization_deployment.py (OSS) test_deploy_from_training_job and the Bedrock import suite Deployed endpoints/imported models were not verified or reliably cleaned up; import-job wait was unbounded Added post-deploy invoke verification, bounded import-job wait with timeout/failure handling, age- and status-aware Bedrock cleanup, retrying invoke, and a yielding deployed_model_arn fixture that deletes the imported model
test_nova_model_customization_deployment.py (model-package & Bedrock paths) test_build/deploy_from_model_package, TestNovaBedrockDeployment Deploying Nova from a ModelPackage is unsupported (escrow artifacts are only resolvable from the TrainingJob manifest; the package is non-RMP) Drove these via the supported TrainingJob path instead of the ModelPackage
test_nova_model_customization_deployment.py (instance type) test_instance_type_from_recipe, test_sft_trainer_build, test_rlvr_trainer_build ModelBuilder defaulted to ml.m5.large, which Nova rejects Pass the supported ml.g6.48xlarge; assert it is used (Nova has no instance-type auto-detection)
test_nova_model_customization_deployment.py (capacity) test_deploy_from_training_job, test_deploy_from_model_package Transient region-wide InsufficientInstanceCapacity for ml.g6.48xlarge (not a quota or code issue) Added _deploy_or_skip_on_capacity helper that skips (rather than fails) on capacity shortage
sagemaker-mlops/tests/integ/test_feature_store_lakeformation.py test_create_feature_group_and_enable_lake_formation, test_create_feature_group_with_lake_formation_enabled, test_enable_lake_formation_full_flow_with_policy_output, test_enable_lake_formation_default_logs_recommended_policy enable_lake_formation defaulted to use_service_linked_role=True, producing RegisterResource with both WithFederation=True and UseServiceLinkedRole=True — a combination Lake Formation rejects (InvalidInputException: Unable to register the following path); all existing registrations in the account used an explicit role Register with use_service_linked_role=False, registration_role_arn=role, matching the supported explicit-role path
sagemaker-mlops/tests/integ/test_feature_store_lakeformation.py test_enable_lake_formation_fails_with_nonexistent_role Negative test asserted the error contains EntityNotFoundException, but under a least-privilege iam:PassRole policy the failure surfaces as AccessDeniedException on iam:PassRole before Lake Formation is reached Accept EntityNotFoundException, AccessDeniedException, or iam:PassRole as valid "role not usable" outcomes
sagemaker-mlops/tests/integ/test_feature_store.py test_delete_feature_group Fixed 2s sleep then a single get(); feature-group deletion is asynchronous and stays describable while Deleting, causing intermittent "DID NOT RAISE" Poll get() until it raises (group gone) or a 120s timeout

Note: the IAM permission gaps these tests also exposed (build-role iam:PassRole
to Bedrock/Lake Formation, bedrock:CreateCustomModel/custom-model-deployment,
and feature-store S3 bucket-policy/encryption permissions) are CI-infrastructure
changes handled separately in the SageMakerMLFPySDKInfraCDK package, not in
this repo.

Add post-deploy invoke verification and make the Bedrock import-job
lifecycle robust in test_model_customization_deployment.py.

- Verify deployed endpoints by invoking them and validating the
  response structure (LORA uses the adapter IC name, otherwise the
  default base IC).
- Replace unconditional stop-all cleanup with age-based (>24h) and
  status-aware cleanup: stop only InProgress/Pending jobs and delete
  completed imported models, with logging on failures.
- Add a class-scoped autouse cleanup_import_jobs fixture to replace the
  zzz-prefixed ordering hack.
- Bound the import-job wait loop with a 60-minute timeout and fail fast
  on Failed status; fix importedModelName -> importedModelArn.
- Delete the imported model after tests via a yielding deployed_model_arn
  fixture.
- Configure bedrock-runtime with standard retries (10 attempts) and add a
  slow-marked, retrying test_bedrock_model_invoke to tolerate
  "model not ready" exceptions.

X-AI-Prompt: Write commit message for the us-west-2 model customization deployment test hardening changes
X-AI-Tool: kiro-cli
…eMaker)

Add a Nova counterpart to test_model_customization_deployment.py covering
ModelBuilder deployment of fine-tuned Nova models to SageMaker endpoints,
running against the Nova test account in us-east-1 (784379639078).

- TestModelCustomizationFromTrainingJob: build, deploy + invoke (Nova
  messages format), and fetch_endpoint_names_for_base_model.
- TestModelCustomizationFromModelPackage: build and deploy from a
  registered model package.
- TestInstanceTypeAutoDetection: instance type auto-detection from recipe.
- TestModelCustomizationDetection: customization detection and model
  package ARN fetch.
- TestTrainerIntegration: SFT and RLVR trainer build (DPO replaced with
  RLVR since Nova has no DPO recipe in SageMakerPublicHub).
- Model package is resolved dynamically from the sdk-test-finetuned-models
  group (latest Completed), mirroring test_benchmark_evaluation_nova_model;
  dependent tests skip when none exists.
- All tests marked us_east_1 so they run in the PR check
  integ-tests-us-east-1 job (intentionally not gpu_intensive, so they do
  not run in the scheduled GPU workflow).
- Register gpu_intensive and us_east_1 markers in sagemaker-serve/tox.ini.

The Bedrock deployment suite is kept commented out for now; the Nova for
Bedrock integ tests will be added in a follow-up.

X-AI-Prompt: Write commit message for the Nova-for-SageMaker model customization deployment integ tests and marker registration
X-AI-Tool: kiro-cli
…g tests

Add TestNovaBedrockDeployment covering deployment of a fine-tuned Nova
model to Amazon Bedrock via BedrockModelBuilder, complementing the existing
Nova-for-SageMaker tests in the same file.

- Deploy a Nova model package through BedrockModelBuilder.deploy(), which
  routes Nova models to create_custom_model + create_custom_model_deployment
  and polls each resource to Active (vs the create_model_import_job path used
  for open-weight models).
- test_nova_bedrock_deployment_active asserts the deployment reaches Active.
- test_nova_bedrock_invoke (slow) invokes the deployed model end-to-end via
  bedrock-runtime, with standard retries to tolerate the cold-start window.
- Model package is resolved dynamically from sdk-test-finetuned-models
  (latest Completed); deployment fixture cleans up the deployment and custom
  model afterwards. Role is resolved via get_execution_role().
- Marked us_east_1 (Nova test account, us-east-1) to run in the PR check
  integ-tests-us-east-1 job; not gpu_intensive.
- Replace the previously commented-out OSS-style Bedrock suite (it used the
  import-job API, which does not apply to Nova) and update the module
  docstring to describe both SageMaker and Bedrock deployment targets.

X-AI-Prompt: Write commit message for the Nova-for-Bedrock model customization deployment integ tests
X-AI-Tool: kiro-cli
- Nova deploy/Bedrock tests: build from the TrainingJob instead of a
  ModelPackage, since Nova escrow artifacts are only resolvable from the
  training job's manifest (deploying from a ModelPackage is unsupported).
- Lake Formation tests: register the S3 location with an explicit role
  (use_service_linked_role=False) to avoid the WithFederation+SLR
  combination that Lake Formation rejects.
The training_job_name fixture hardcoded a reusable job whose output model
package (sdk-test-nova-finetuned-models/1) was deleted, so every test that
resolves the job's output model package failed with "ModelPackage ... does not
exist".

Discover the latest completed sft-nova-integ-* job at runtime (produced every
few hours by the scheduled Nova SFT workflow) and verify its output model
package still exists before using it; skip if none is found. This avoids
depending on a hardcoded job name that goes stale once resource cleanup deletes
its model package.

X-AI-Prompt: Replace the hardcoded Nova training job fixture with runtime discovery of the latest completed sft-nova-integ job whose output model package still exists
X-AI-Tool: kiro-cli
BedrockModelBuilder._get_checkpoint_uri_from_manifest located manifest.json via
self.model.model_artifacts.s3_model_artifacts. Nova fine-tuning jobs produced by
SFTTrainer/RLVRTrainer/DPOTrainer run serverless and do not populate
model_artifacts (it is Unassigned; there is no model.tar.gz), so deploying a Nova
TrainingJob to Bedrock failed with
"AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts'".

Build the manifest path from output_data_config.s3_output_path and the training
job name instead. This aligns with the two other implementations that locate the
Nova manifest the same way:
- ModelBuilder._resolve_nova_escrow_uri (SageMaker deployment path), and
- the official Nova Studio notebook
  (v3-examples/.../sm-studio-nova-training-job-sample-notebook.ipynb, which
  derives the manifest from OutputDataConfig.S3OutputPath, not model_artifacts).

Verified the derived key is identical to the previous logic when model_artifacts
is present, and matches the real manifest location
({s3_output}/{job_name}/output/output/manifest.json) confirmed in the test
account.

Also update the TestGetCheckpointUri unit tests to mock output_data_config, and
keep the Nova Bedrock integ tests driving BedrockModelBuilder from the
TrainingJob.

X-AI-Prompt: Fix BedrockModelBuilder Nova manifest resolution to use output_data_config (matching ModelBuilder._resolve_nova_escrow_uri and the official Nova Studio notebook) and update unit tests
X-AI-Tool: kiro-cli
…y on capacity shortage

- _resolve_nova_escrow_uri only accepted TrainingJob/ModelTrainer, so building a
  Nova model from an SFTTrainer/RLVRTrainer/DPOTrainer (BaseTrainer subclasses)
  failed with "Nova escrow URI resolution requires a TrainingJob or
  ModelTrainer". Resolve the underlying job via _latest_training_job for
  BaseTrainer, matching _is_model_customization and _fetch_model_package_arn.
- Nova deploy integ tests could fail with InsufficientInstanceCapacity, a
  transient region-wide ml.g6.48xlarge availability issue. Add a
  _deploy_or_skip_on_capacity helper that skips (instead of failing) in that
  case, used by the training-job and model-package deploy tests.

X-AI-Prompt: Support BaseTrainer in _resolve_nova_escrow_uri and skip Nova deploy tests on transient InsufficientInstanceCapacity
X-AI-Tool: kiro-cli
…sync FG deletion

test_enable_lake_formation_fails_with_nonexistent_role asserted the registration
error contains EntityNotFoundException, but under a least-privilege iam:PassRole
policy the failure surfaces as an AccessDeniedException on iam:PassRole before
Lake Formation is reached. Accept EntityNotFoundException, AccessDeniedException,
or iam:PassRole as valid "role not usable" outcomes for this negative test.

test_delete_feature_group used a fixed 2s sleep then a single get(), but
FeatureGroup deletion is asynchronous and the group stays describable while in
Deleting status, causing intermittent "DID NOT RAISE". Poll get() until it
raises (group fully gone) or a 120s timeout.

X-AI-Prompt: Fix LF nonexistent-role negative test assertion and poll for async feature group deletion
X-AI-Tool: kiro-cli
aviruthen
aviruthen previously approved these changes Jun 8, 2026
test_nova_bedrock_invoke sent content items as {"type": "text", "text": ...},
which Bedrock rejected with "Malformed input request: #/messages/0/content/0:
extraneous key [type] is not permitted".

Use the Nova messages-v1 InvokeModel schema instead (content items are
{"text": ...} with no type key, plus schemaVersion and inferenceConfig),
matching the official Nova Studio notebook, and assert on the Nova response
shape output.message.content[0].text.

X-AI-Prompt: Fix the Nova Bedrock invoke payload to the messages-v1 schema (no type key) per the official Nova notebook and assert the Nova response structure
X-AI-Tool: kiro-cli
aviruthen
aviruthen previously approved these changes Jun 9, 2026
…kage

The training_job_name fixture required the job's output model package to still
exist, but the resource cleaner keeps only the oldest and newest package in the
group, so every job's package was deleted and all dependent tests skipped.
Build/deploy resolve artifacts from the job manifest (not the model package),
so just pick the latest completed sft-nova-integ job.

X-AI-Prompt: Stop requiring the Nova SFT job's output model package to exist in the fixture so tests stop skipping
X-AI-Tool: kiro-cli
ModelBuilder.build fetches the training job's output model package, so the
package must exist. Resource cleanup keeps only the oldest and newest package
in the group, so picking the latest job left it pointing at a deleted package
and every build/deploy test failed.

Instead, start from a model package that currently exists and resolve the
training job that produced it (parsed from the package's escrow S3 URI),
preferring an SFT job. The cleaner always retains the oldest package, so this
reliably yields a job whose output package is present.

X-AI-Prompt: Resolve the Nova training job by reverse-lookup from an existing model package's escrow S3 URI so build/deploy tests stop failing on deleted packages
X-AI-Tool: kiro-cli
@lucasjia-aws lucasjia-aws merged commit 63ac789 into aws:master Jun 9, 2026
32 of 48 checks passed
@lucasjia-aws lucasjia-aws deleted the test/nova_tests branch June 11, 2026 22:20
guanweim pushed a commit to guanweim/sagemaker-python-sdk that referenced this pull request Jun 15, 2026
* test(serve): harden model customization deployment integ tests

Add post-deploy invoke verification and make the Bedrock import-job
lifecycle robust in test_model_customization_deployment.py.

- Verify deployed endpoints by invoking them and validating the
  response structure (LORA uses the adapter IC name, otherwise the
  default base IC).
- Replace unconditional stop-all cleanup with age-based (>24h) and
  status-aware cleanup: stop only InProgress/Pending jobs and delete
  completed imported models, with logging on failures.
- Add a class-scoped autouse cleanup_import_jobs fixture to replace the
  zzz-prefixed ordering hack.
- Bound the import-job wait loop with a 60-minute timeout and fail fast
  on Failed status; fix importedModelName -> importedModelArn.
- Delete the imported model after tests via a yielding deployed_model_arn
  fixture.
- Configure bedrock-runtime with standard retries (10 attempts) and add a
  slow-marked, retrying test_bedrock_model_invoke to tolerate
  "model not ready" exceptions.

X-AI-Prompt: Write commit message for the us-west-2 model customization deployment test hardening changes
X-AI-Tool: kiro-cli

* test(serve): add Nova model customization deployment integ tests (SageMaker)

Add a Nova counterpart to test_model_customization_deployment.py covering
ModelBuilder deployment of fine-tuned Nova models to SageMaker endpoints,
running against the Nova test account in us-east-1 (784379639078).

- TestModelCustomizationFromTrainingJob: build, deploy + invoke (Nova
  messages format), and fetch_endpoint_names_for_base_model.
- TestModelCustomizationFromModelPackage: build and deploy from a
  registered model package.
- TestInstanceTypeAutoDetection: instance type auto-detection from recipe.
- TestModelCustomizationDetection: customization detection and model
  package ARN fetch.
- TestTrainerIntegration: SFT and RLVR trainer build (DPO replaced with
  RLVR since Nova has no DPO recipe in SageMakerPublicHub).
- Model package is resolved dynamically from the sdk-test-finetuned-models
  group (latest Completed), mirroring test_benchmark_evaluation_nova_model;
  dependent tests skip when none exists.
- All tests marked us_east_1 so they run in the PR check
  integ-tests-us-east-1 job (intentionally not gpu_intensive, so they do
  not run in the scheduled GPU workflow).
- Register gpu_intensive and us_east_1 markers in sagemaker-serve/tox.ini.

The Bedrock deployment suite is kept commented out for now; the Nova for
Bedrock integ tests will be added in a follow-up.

X-AI-Prompt: Write commit message for the Nova-for-SageMaker model customization deployment integ tests and marker registration
X-AI-Tool: kiro-cli

* test(serve): add Nova for Bedrock model customization deployment integ tests

Add TestNovaBedrockDeployment covering deployment of a fine-tuned Nova
model to Amazon Bedrock via BedrockModelBuilder, complementing the existing
Nova-for-SageMaker tests in the same file.

- Deploy a Nova model package through BedrockModelBuilder.deploy(), which
  routes Nova models to create_custom_model + create_custom_model_deployment
  and polls each resource to Active (vs the create_model_import_job path used
  for open-weight models).
- test_nova_bedrock_deployment_active asserts the deployment reaches Active.
- test_nova_bedrock_invoke (slow) invokes the deployed model end-to-end via
  bedrock-runtime, with standard retries to tolerate the cold-start window.
- Model package is resolved dynamically from sdk-test-finetuned-models
  (latest Completed); deployment fixture cleans up the deployment and custom
  model afterwards. Role is resolved via get_execution_role().
- Marked us_east_1 (Nova test account, us-east-1) to run in the PR check
  integ-tests-us-east-1 job; not gpu_intensive.
- Replace the previously commented-out OSS-style Bedrock suite (it used the
  import-job API, which does not apply to Nova) and update the module
  docstring to describe both SageMaker and Bedrock deployment targets.

X-AI-Prompt: Write commit message for the Nova-for-Bedrock model customization deployment integ tests
X-AI-Tool: kiro-cli

* test: fix Nova deployment and Lake Formation integ tests

- Nova deploy/Bedrock tests: build from the TrainingJob instead of a
  ModelPackage, since Nova escrow artifacts are only resolvable from the
  training job's manifest (deploying from a ModelPackage is unsupported).
- Lake Formation tests: register the S3 location with an explicit role
  (use_service_linked_role=False) to avoid the WithFederation+SLR
  combination that Lake Formation rejects.

* test(serve): discover Nova SFT training job dynamically

The training_job_name fixture hardcoded a reusable job whose output model
package (sdk-test-nova-finetuned-models/1) was deleted, so every test that
resolves the job's output model package failed with "ModelPackage ... does not
exist".

Discover the latest completed sft-nova-integ-* job at runtime (produced every
few hours by the scheduled Nova SFT workflow) and verify its output model
package still exists before using it; skip if none is found. This avoids
depending on a hardcoded job name that goes stale once resource cleanup deletes
its model package.

X-AI-Prompt: Replace the hardcoded Nova training job fixture with runtime discovery of the latest completed sft-nova-integ job whose output model package still exists
X-AI-Tool: kiro-cli

* fix(serve): resolve Nova Bedrock manifest from output_data_config

BedrockModelBuilder._get_checkpoint_uri_from_manifest located manifest.json via
self.model.model_artifacts.s3_model_artifacts. Nova fine-tuning jobs produced by
SFTTrainer/RLVRTrainer/DPOTrainer run serverless and do not populate
model_artifacts (it is Unassigned; there is no model.tar.gz), so deploying a Nova
TrainingJob to Bedrock failed with
"AttributeError: 'Unassigned' object has no attribute 's3_model_artifacts'".

Build the manifest path from output_data_config.s3_output_path and the training
job name instead. This aligns with the two other implementations that locate the
Nova manifest the same way:
- ModelBuilder._resolve_nova_escrow_uri (SageMaker deployment path), and
- the official Nova Studio notebook
  (v3-examples/.../sm-studio-nova-training-job-sample-notebook.ipynb, which
  derives the manifest from OutputDataConfig.S3OutputPath, not model_artifacts).

Verified the derived key is identical to the previous logic when model_artifacts
is present, and matches the real manifest location
({s3_output}/{job_name}/output/output/manifest.json) confirmed in the test
account.

Also update the TestGetCheckpointUri unit tests to mock output_data_config, and
keep the Nova Bedrock integ tests driving BedrockModelBuilder from the
TrainingJob.

X-AI-Prompt: Fix BedrockModelBuilder Nova manifest resolution to use output_data_config (matching ModelBuilder._resolve_nova_escrow_uri and the official Nova Studio notebook) and update unit tests
X-AI-Tool: kiro-cli

* fix(serve): support BaseTrainer in Nova escrow resolution; skip deploy on capacity shortage

- _resolve_nova_escrow_uri only accepted TrainingJob/ModelTrainer, so building a
  Nova model from an SFTTrainer/RLVRTrainer/DPOTrainer (BaseTrainer subclasses)
  failed with "Nova escrow URI resolution requires a TrainingJob or
  ModelTrainer". Resolve the underlying job via _latest_training_job for
  BaseTrainer, matching _is_model_customization and _fetch_model_package_arn.
- Nova deploy integ tests could fail with InsufficientInstanceCapacity, a
  transient region-wide ml.g6.48xlarge availability issue. Add a
  _deploy_or_skip_on_capacity helper that skips (instead of failing) in that
  case, used by the training-job and model-package deploy tests.

X-AI-Prompt: Support BaseTrainer in _resolve_nova_escrow_uri and skip Nova deploy tests on transient InsufficientInstanceCapacity
X-AI-Tool: kiro-cli

* Fix flaky feature store integ tests: LF negative-role assertion and async FG deletion

test_enable_lake_formation_fails_with_nonexistent_role asserted the registration
error contains EntityNotFoundException, but under a least-privilege iam:PassRole
policy the failure surfaces as an AccessDeniedException on iam:PassRole before
Lake Formation is reached. Accept EntityNotFoundException, AccessDeniedException,
or iam:PassRole as valid "role not usable" outcomes for this negative test.

test_delete_feature_group used a fixed 2s sleep then a single get(), but
FeatureGroup deletion is asynchronous and the group stays describable while in
Deleting status, causing intermittent "DID NOT RAISE". Poll get() until it
raises (group fully gone) or a 120s timeout.

X-AI-Prompt: Fix LF nonexistent-role negative test assertion and poll for async feature group deletion
X-AI-Tool: kiro-cli

* test(serve): use Nova messages-v1 schema for Bedrock invoke

test_nova_bedrock_invoke sent content items as {"type": "text", "text": ...},
which Bedrock rejected with "Malformed input request: #/messages/0/content/0:
extraneous key [type] is not permitted".

Use the Nova messages-v1 InvokeModel schema instead (content items are
{"text": ...} with no type key, plus schemaVersion and inferenceConfig),
matching the official Nova Studio notebook, and assert on the Nova response
shape output.message.content[0].text.

X-AI-Prompt: Fix the Nova Bedrock invoke payload to the messages-v1 schema (no type key) per the official Nova notebook and assert the Nova response structure
X-AI-Tool: kiro-cli

* chore(serve): trim verbose comments

* test(serve): pick latest Nova SFT job without requiring its model package

The training_job_name fixture required the job's output model package to still
exist, but the resource cleaner keeps only the oldest and newest package in the
group, so every job's package was deleted and all dependent tests skipped.
Build/deploy resolve artifacts from the job manifest (not the model package),
so just pick the latest completed sft-nova-integ job.

X-AI-Prompt: Stop requiring the Nova SFT job's output model package to exist in the fixture so tests stop skipping
X-AI-Tool: kiro-cli

* test(serve): resolve Nova training job from an existing model package

ModelBuilder.build fetches the training job's output model package, so the
package must exist. Resource cleanup keeps only the oldest and newest package
in the group, so picking the latest job left it pointing at a deleted package
and every build/deploy test failed.

Instead, start from a model package that currently exists and resolve the
training job that produced it (parsed from the package's escrow S3 URI),
preferring an SFT job. The cleaner always retains the oldest package, so this
reliably yields a job whose output package is present.

X-AI-Prompt: Resolve the Nova training job by reverse-lookup from an existing model package's escrow S3 URI so build/deploy tests stop failing on deleted packages
X-AI-Tool: kiro-cli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants