Minimal, cost-safe SageMaker MLOps reference for production-style deployments.
This repo is infrastructure-first, documenting how to deploy models safely, not how to experiment.
No notebooks. No Studio. No hand-clicking.
- Deterministic training
- Immutable model artifacts
- Terraform-controlled deployments
- Canary + rollback without retraining
- Minimal AWS blast radius
- Interview-defensible design
-
Local VS Code + AWS CLI
-
SageMaker built-in
sklearncontainer -
S3 for model artifacts
-
SageMaker:
- Model
- EndpointConfig
- Endpoint
-
Terraform as single control plane
-
(Later) GitHub Actions for orchestration
.
├── bootstrap/ # One-time infra: state bucket, DynamoDB lock table, IAM
├── data/ # Training / Testing data
├── model/ # Model Creation Artifcats
├── terraform/ # MLOps infra: models, endpoints, configs
├── pipelines/ # Training / Inference / Deployment / API Testing
├── README.md
bootstrap/is never run automatically. It exists to document how Terraform state is safely provisioned.
-
Remote state stored in S3
-
State locking via DynamoDB
-
IAM scoped to:
- S3 state bucket objects only
- DynamoDB lock table only
-
Optional GitHub Actions OIDC trust for CI
This avoids:
- Shared global state
- Cross-account contamination
- Accidental privilege escalation
- Navigate to bootstrap folder
cd bootstrap- Initialize Terraform
terraform init- Apply bootstrap infrastructure
terraform applyWhat gets created:
- S3 bucket for remote state
- DynamoDB table for state locks
- IAM role with policy scoped to state access
- (Optionally) GitHub OIDC provider
⚠️ Only run bootstrap once per account. Do not run it from CI automatically.
- Reference backend in envs
terraform {
backend "s3" {
bucket = "<state-bucket-name>"
key = "envs/dev/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "<lock-table-name>"
}
}- Define variables
variables.tf # Contract
terraform.tfvars # Environment-specific valuesTerraform does not auto-discover folders. You explicitly choose what to run:
cd terraform
terraform init
terraform plan
terraform applyEach folder = explicit execution boundary.
- SAGEMAKER_EXECUTION_ROLE
- MODEL_BUCKET
- SAGEMAKER_S3_CAPTURE_UPLOAD_PATH
- Train model → upload artifact to S3
- Create SageMaker Model
- Create EndpointConfig
- Deploy Endpoint
- Canary via weighted variants
- Promote via weight shift
- Rollback via config swap
No retraining required for rollback.
| Scenario | Cause | Mitigation / Recovery |
|---|---|---|
| Terraform state lock held | Previous plan crashed | terraform force-unlock <ID> |
| Partial apply | Module error or timeout | Re-run terraform apply (idempotent) |
| Canary triggers alarm | Unexpected model behavior | Rollback: swap to previous EndpointConfig |
| Endpoint left running overnight | Misoperation | Manual termination, investigate CI job schedule |
| Missing backend resources | Bootstrap not applied | Apply bootstrap/ first |
Always isolate bootstrap vs env infra to minimize blast radius.
- Region:
us-east-1 - Training:
ml.t3.large - Inference:
ml.t2.large - Endpoints must be deleted same day
An endpoint existing overnight is considered a failure.
- ❌ AutoML
- ❌ Feature Store
- ❌ Streaming inference
- ❌ Online experimentation
- ❌ Data science workflows
This is deployment mechanics only.
You should be able to confidently:
- Deploy SageMaker endpoints via Terraform
- Explain canary vs rollback mechanics
- Reason about blast radius and IAM
- Defend the design in senior interviews
- Extend this into multi-account setups
🚧 In progress Focused on correctness > completeness