Skip to content

thisisbhanuj/sagemaker-mlops-minimal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SageMaker MLOps Minimal

Minimal, cost-safe SageMaker MLOps reference for production-style deployments.

This repo is infrastructure-first, documenting how to deploy models safely, not how to experiment.

No notebooks. No Studio. No hand-clicking.


Goals (Non-Negotiable)

  • Deterministic training
  • Immutable model artifacts
  • Terraform-controlled deployments
  • Canary + rollback without retraining
  • Minimal AWS blast radius
  • Interview-defensible design

High-Level Architecture

  • Local VS Code + AWS CLI

  • SageMaker built-in sklearn container

  • S3 for model artifacts

  • SageMaker:

    • Model
    • EndpointConfig
    • Endpoint
  • Terraform as single control plane

  • (Later) GitHub Actions for orchestration


Repository Structure (Conceptual)

.
├── bootstrap/        # One-time infra: state bucket, DynamoDB lock table, IAM
├── data/             # Training / Testing data
├── model/            # Model Creation Artifcats
├── terraform/        # MLOps infra: models, endpoints, configs
├── pipelines/        # Training / Inference / Deployment / API Testing
├── README.md

bootstrap/ is never run automatically. It exists to document how Terraform state is safely provisioned.


Terraform State Model

  • Remote state stored in S3

  • State locking via DynamoDB

  • IAM scoped to:

    • S3 state bucket objects only
    • DynamoDB lock table only
  • Optional GitHub Actions OIDC trust for CI

This avoids:

  • Shared global state
  • Cross-account contamination
  • Accidental privilege escalation

First-Time Setup (Bootstrap)

  1. Navigate to bootstrap folder
cd bootstrap
  1. Initialize Terraform
terraform init
  1. Apply bootstrap infrastructure
terraform apply

What gets created:

  • S3 bucket for remote state
  • DynamoDB table for state locks
  • IAM role with policy scoped to state access
  • (Optionally) GitHub OIDC provider

⚠️ Only run bootstrap once per account. Do not run it from CI automatically.

  1. Reference backend in envs
terraform {
  backend "s3" {
    bucket = "<state-bucket-name>"
    key    = "envs/dev/terraform.tfstate"
    region = "us-east-1"
    dynamodb_table = "<lock-table-name>"
  }
}
  1. Define variables
variables.tf         # Contract
terraform.tfvars      # Environment-specific values

Terraform Execution Model

Terraform does not auto-discover folders. You explicitly choose what to run:

cd terraform
terraform init
terraform plan
terraform apply

Each folder = explicit execution boundary.


Setup Env Variables

  1. SAGEMAKER_EXECUTION_ROLE
  2. MODEL_BUCKET
  3. SAGEMAKER_S3_CAPTURE_UPLOAD_PATH

Deployment Model

  1. Train model → upload artifact to S3
  2. Create SageMaker Model
  3. Create EndpointConfig
  4. Deploy Endpoint
  5. Canary via weighted variants
  6. Promote via weight shift
  7. Rollback via config swap

No retraining required for rollback.


Failure Scenarios (and Mitigation)

Scenario Cause Mitigation / Recovery
Terraform state lock held Previous plan crashed terraform force-unlock <ID>
Partial apply Module error or timeout Re-run terraform apply (idempotent)
Canary triggers alarm Unexpected model behavior Rollback: swap to previous EndpointConfig
Endpoint left running overnight Misoperation Manual termination, investigate CI job schedule
Missing backend resources Bootstrap not applied Apply bootstrap/ first

Always isolate bootstrap vs env infra to minimize blast radius.


Guardrails (Cost + Safety)

  • Region: us-east-1
  • Training: ml.t3.large
  • Inference: ml.t2.large
  • Endpoints must be deleted same day

An endpoint existing overnight is considered a failure.


What This Repo Is NOT

  • ❌ AutoML
  • ❌ Feature Store
  • ❌ Streaming inference
  • ❌ Online experimentation
  • ❌ Data science workflows

This is deployment mechanics only.


Target Outcome

You should be able to confidently:

  • Deploy SageMaker endpoints via Terraform
  • Explain canary vs rollback mechanics
  • Reason about blast radius and IAM
  • Defend the design in senior interviews
  • Extend this into multi-account setups

Status

🚧 In progress Focused on correctness > completeness


About

SageMaker MLOps - minimal, cost-safe reference for production-style deployments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors