SageMaker MLOps Minimal

Minimal, cost-safe SageMaker MLOps reference for production-style deployments.

This repo is infrastructure-first, documenting how to deploy models safely, not how to experiment.

No notebooks. No Studio. No hand-clicking.

Goals (Non-Negotiable)

Deterministic training
Immutable model artifacts
Terraform-controlled deployments
Canary + rollback without retraining
Minimal AWS blast radius
Interview-defensible design

High-Level Architecture

Local VS Code + AWS CLI
SageMaker built-in sklearn container
S3 for model artifacts
SageMaker:
- Model
- EndpointConfig
- Endpoint
Terraform as single control plane
(Later) GitHub Actions for orchestration

Repository Structure (Conceptual)

.
├── bootstrap/        # One-time infra: state bucket, DynamoDB lock table, IAM
├── data/             # Training / Testing data
├── model/            # Model Creation Artifcats
├── terraform/        # MLOps infra: models, endpoints, configs
├── pipelines/        # Training / Inference / Deployment / API Testing
├── README.md

bootstrap/ is never run automatically. It exists to document how Terraform state is safely provisioned.

Terraform State Model

Remote state stored in S3
State locking via DynamoDB
IAM scoped to:
- S3 state bucket objects only
- DynamoDB lock table only
Optional GitHub Actions OIDC trust for CI

This avoids:

Shared global state
Cross-account contamination
Accidental privilege escalation

First-Time Setup (Bootstrap)

Navigate to bootstrap folder

cd bootstrap

Initialize Terraform

terraform init

Apply bootstrap infrastructure

terraform apply

What gets created:

S3 bucket for remote state
DynamoDB table for state locks
IAM role with policy scoped to state access
(Optionally) GitHub OIDC provider

⚠️ Only run bootstrap once per account. Do not run it from CI automatically.

Reference backend in envs

terraform {
  backend "s3" {
    bucket = "<state-bucket-name>"
    key    = "envs/dev/terraform.tfstate"
    region = "us-east-1"
    dynamodb_table = "<lock-table-name>"
  }
}

Define variables

variables.tf         # Contract
terraform.tfvars      # Environment-specific values

Terraform Execution Model

Terraform does not auto-discover folders. You explicitly choose what to run:

cd terraform
terraform init
terraform plan
terraform apply

Each folder = explicit execution boundary.

Setup Env Variables

SAGEMAKER_EXECUTION_ROLE
MODEL_BUCKET
SAGEMAKER_S3_CAPTURE_UPLOAD_PATH

Deployment Model

Train model → upload artifact to S3
Create SageMaker Model
Create EndpointConfig
Deploy Endpoint
Canary via weighted variants
Promote via weight shift
Rollback via config swap

No retraining required for rollback.

Failure Scenarios (and Mitigation)

Scenario	Cause	Mitigation / Recovery
Terraform state lock held	Previous plan crashed	`terraform force-unlock <ID>`
Partial apply	Module error or timeout	Re-run `terraform apply` (idempotent)
Canary triggers alarm	Unexpected model behavior	Rollback: swap to previous EndpointConfig
Endpoint left running overnight	Misoperation	Manual termination, investigate CI job schedule
Missing backend resources	Bootstrap not applied	Apply `bootstrap/` first

Always isolate bootstrap vs env infra to minimize blast radius.

Guardrails (Cost + Safety)

Region: us-east-1
Training: ml.t3.large
Inference: ml.t2.large
Endpoints must be deleted same day

An endpoint existing overnight is considered a failure.

What This Repo Is NOT

❌ AutoML
❌ Feature Store
❌ Streaming inference
❌ Online experimentation
❌ Data science workflows

This is deployment mechanics only.

Target Outcome

You should be able to confidently:

Deploy SageMaker endpoints via Terraform
Explain canary vs rollback mechanics
Reason about blast radius and IAM
Defend the design in senior interviews
Extend this into multi-account setups

Status

🚧 In progress Focused on correctness > completeness

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SageMaker MLOps Minimal

Goals (Non-Negotiable)

High-Level Architecture

Repository Structure (Conceptual)

Terraform State Model

First-Time Setup (Bootstrap)

Terraform Execution Model

Setup Env Variables

Deployment Model

Failure Scenarios (and Mitigation)

Guardrails (Cost + Safety)

What This Repo Is NOT

Target Outcome

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
bootstrap		bootstrap
data		data
model		model
pipelines		pipelines
schema		schema
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SageMaker MLOps Minimal

Goals (Non-Negotiable)

High-Level Architecture

Repository Structure (Conceptual)

Terraform State Model

First-Time Setup (Bootstrap)

Terraform Execution Model

Setup Env Variables

Deployment Model

Failure Scenarios (and Mitigation)

Guardrails (Cost + Safety)

What This Repo Is NOT

Target Outcome

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages