ModelOps Developer Guide

This guide consolidates developer documentation for ModelOps. For architecture details, see docs/architecture/.

Testing
Image Management
Common Issues & Fixes
Debugging Commands
Building & Deploying

Testing

Running Tests

# Unit tests (default, fast ~10-20s)
make test
# or
uv run pytest

# Integration tests (creates LocalCluster instances)
make test-integration

# Run specific test file
uv run pytest tests/test_component_dependencies.py

# Run specific test function
uv run pytest tests/test_dask_serialization.py::test_cloudpickle_simtask

# Run with coverage
uv run pytest --cov=modelops --cov-report=html

Using External Dask for Debugging

By default, integration tests create their own LocalCluster. To use an external cluster:

# Start external Dask cluster
make dask-local

# Use external cluster (must explicitly opt-in)
DASK_ADDRESS=tcp://localhost:8786 make test-integration
# or
make test-integration-external  # uses --dask-address flag

# Stop when done
make dask-stop

CI Behavior

Resource Scaling: CI uses 1 worker with 1GB memory (vs 2 workers, 2GB locally)
Timeouts: 60-second per test, 10-minute overall
Auto-skip: Tests skip gracefully when resources are constrained

Image Management

Single Source of Truth

All Docker image references are centralized in modelops-images.yaml:

profiles:
  prod:
    registry: {host: ghcr.io, org: institutefordiseasemodeling}
    default_tag: latest
  dev:
    registry: {host: ghcr.io, org: institutefordiseasemodeling}
    default_tag: dev

images:
  scheduler: {name: modelops-dask-scheduler}
  worker: {name: modelops-dask-worker}
  runner: {name: modelops-dask-runner}

Using Image Configuration

# CLI access to image config
mops dev images print scheduler     # Single image
mops dev images print --all         # All images
mops dev images export-env          # Export as env vars

# In Python code
from modelops.images import get_image_config
config = get_image_config()
worker_image = config.worker_image()  # ghcr.io/institutefordiseasemodeling/modelops-dask-worker:latest

# In Makefile
WORKER_IMAGE := $(shell uv run mops dev images print worker)

Digest-Based Deployment (Preventing Cache Issues)

The :latest tag is mutable and heavily cached by Kubernetes. Use digests for reliable deployments:

# Build and capture digest
make build-worker
# Stores digest in .build/worker.digest

# Deploy by digest (not tag)
kubectl set image deployment/dask-workers \
  worker=$(WORKER_IMAGE)@$(cat .build/worker.digest) \
  -n modelops-dask-dev

# Verify deployment
kubectl get pods -l app=dask-worker -o jsonpath='{.items[0].status.containerStatuses[0].imageID}'

Common Issues & Fixes

Pulumi Passphrase Errors

Error: "incorrect passphrase" when accessing Pulumi stacks

Root Cause: PULUMI_CONFIG_PASSPHRASE_FILE not passed to subprocess

Fix: Ensure env_vars=dict(os.environ) in src/modelops/core/automation.py:workspace_options()

# CRITICAL: Pass full environment to subprocess
return auto.LocalWorkspaceOptions(
    env_vars=dict(os.environ)  # Must pass environment
)

Bundle Registry Authentication

Error: "Expecting value: line 1 column 1 (char 0)" when fetching bundles

Root Cause: ACR returning HTML login page instead of JSON

Common Causes:

Repository name mismatch (e.g., pushing to smoke_bundle, pulling from modelops-bundles)
Bundle reference format inconsistency (need repository@sha256:digest)
Wrong registry URL in environment

Fix: Ensure consistent repository naming and format:

# Correct format
bundle_ref = "smoke_bundle@sha256:abc123..."
MODELOPS_BUNDLE_REGISTRY = "modelopsdevacrvsb.azurecr.io"  # No repository path

Kubernetes Using Stale Images

Symptom: Fixes aren't working despite make deploy

Root Cause: Kubernetes caches :latest tags aggressively

Quick Fix:

# Force delete pods to pull fresh images
kubectl delete pods -n modelops-dask-dev -l app=dask-worker --force --grace-period=0

# Verify new code is running
kubectl exec deployment/dask-workers -n modelops-dask-dev -- \
  grep -A3 "your_function" /path/to/file.py

Better Fix: Use digest-based deployment (see above)

Dask Fixture Timeouts

Error: Integration tests hang for 30+ seconds

Root Cause: Tests trying to connect to external Dask before creating LocalCluster

Fix: Default to LocalCluster (already fixed in conftest.py):

# Tests now create LocalCluster by default
# Must explicitly opt-in to external with --dask-address or DASK_ADDRESS

Debugging Commands

Check Pod Status and Logs

# List pods
kubectl get pods -n modelops-dask-dev

# Check pod details
kubectl describe pod <pod-name> -n modelops-dask-dev

# View logs
kubectl logs -n modelops-dask-dev -l app=dask-scheduler
kubectl logs -n modelops-dask-dev -l app=dask-worker --tail=50

# Follow logs
kubectl logs -f deployment/dask-workers -n modelops-dask-dev

Port Forwarding

# Dask scheduler
kubectl port-forward -n modelops-dask-dev svc/dask-scheduler 8786:8786

# Dask dashboard
kubectl port-forward -n modelops-dask-dev svc/dask-scheduler 8787:8787

# Multiple ports
kubectl port-forward -n modelops-dask-dev svc/dask-scheduler 8786:8786 8787:8787

Verify Deployments

# Check what image a pod is actually running
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].imageID}'

# Check environment variables
kubectl exec -it deployment/dask-workers -n modelops-dask-dev -- env | grep MODELOPS

# Check actual code in pod
kubectl exec deployment/dask-workers -n modelops-dask-dev -- \
  cat /usr/local/lib/python3.13/site-packages/modelops/__version__.py

# Force rollout restart
kubectl rollout restart deployment/dask-workers -n modelops-dask-dev
kubectl rollout status deployment/dask-workers -n modelops-dask-dev

Pulumi State Inspection

# Check stack outputs
pulumi stack output --stack modelops-infra-dev

# List all stacks
pulumi stack ls

# Check specific output
pulumi stack output kubeconfig --stack modelops-infra-dev

# Show full stack state (verbose)
pulumi stack export --stack modelops-infra-dev | jq .

Building & Deploying

GitHub Actions Workflow

Images are automatically built on push to main:

Triggered by .github/workflows/docker-build.yml
Pushes to ghcr.io/institutefordiseasemodeling/
Tagged with commit SHA and latest

Dependency Installation (Calabaria and Other Packages)

External dependencies like modelops-calabaria and modelops-bundle are installed at Docker image build time via pip from GitHub repositories.

How it works:

In docker/Dockerfile.runner (lines 40-44):

# Install modelops-bundle (needed for bundle management)
RUN pip install --no-cache-dir git+https://${GITHUB_TOKEN}@github.com/institutefordiseasemodeling/modelops-bundle.git

# Install modelops-calabaria for calibration support
RUN pip install --no-cache-dir git+https://${GITHUB_TOKEN}@github.com/institutefordiseasemodeling/modelops-calabaria.git

These lines pull the latest code from the main branch of each repository at build time.

Deployment Workflow for Fixes:

When you make a fix to calabaria (or any other installed dependency):

Commit and push to calabaria repo:

cd modelops-calabaria
git add src/modelops_calabaria/calibration/wire.py
git commit -m "fix: handle dict results in convert_to_trial_result"
git push origin main

Trigger image rebuild: The fix won't be in running pods until images are rebuilt with the updated dependency.

Option A: Automatic rebuild (CI/CD)
- Push any commit to the modelops repo (even a trivial change)
- GitHub Actions will trigger and rebuild all images
- Images are pushed to GHCR with the new calabaria code included
Option B: Manual rebuild
```
cd modelops
make build-runner  # Rebuilds runner image with latest calabaria from GitHub
docker push ghcr.io/institutefordiseasemodeling/modelops-dask-runner:latest
```

Restart Kubernetes pods to pull new images:

kubectl rollout restart deployment/dask-runner -n modelops-dask-dev
kubectl rollout status deployment/dask-runner -n modelops-dask-dev

Why Image Rebuilds Are Required:

Unlike code changes to modelops itself (which are in the COPY layer), calabaria is installed via pip install git+https://.... This means:

The calabaria code is baked into the image at build time
Simply restarting pods won't pick up calabaria fixes
You must rebuild the image to get the latest code from GitHub
Kubernetes image pull policies may cache :latest tags aggressively (use digests for reliability)

Quick Fix Verification:

After deploying, verify the fix is actually running:

# Check that the new code is present in the pod
kubectl exec deployment/dask-runner -n modelops-dask-dev -- \
  grep -A5 "isinstance(result, dict)" \
  /usr/local/lib/python3.12/site-packages/modelops_calabaria/calibration/wire.py

# Or check the installed package version/commit
kubectl exec deployment/dask-runner -n modelops-dask-dev -- \
  pip show modelops-calabaria

Local Development Build

# Build all images
make build  # Builds scheduler, worker, runner

# Build specific image
make build-worker
make build-scheduler
make build-runner

# Push to registry (after building)
make push

# Pull latest from registry
make pull-latest

# Full deployment cycle
make build push deploy verify-deploy

Deployment Verification

Always verify deployments actually worked:

# Custom verification command
make verify-deploy

# Manual verification
kubectl get pods -n modelops-dask-dev
kubectl logs -n modelops-dask-dev -l app=dask-worker --tail=10

# Run smoke test
mops dev smoke-test

Additional Resources

Tips

Always verify deployments - Don't trust that make deploy worked
Use digests for production - Tags are mutable and cached
Check environment variables - Many issues are missing env vars
Force delete pods when in doubt - Kubernetes caching is aggressive
Review the image config - Single source of truth in modelops-images.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelOps Developer Guide

Table of Contents

Testing

Running Tests

Using External Dask for Debugging

CI Behavior

Image Management

Single Source of Truth

Using Image Configuration

Digest-Based Deployment (Preventing Cache Issues)

Common Issues & Fixes

Pulumi Passphrase Errors

Bundle Registry Authentication

Kubernetes Using Stale Images

Dask Fixture Timeouts

Debugging Commands

Check Pod Status and Logs

Port Forwarding

Verify Deployments

Pulumi State Inspection

Building & Deploying

GitHub Actions Workflow

Dependency Installation (Calabaria and Other Packages)

Local Development Build

Deployment Verification

Additional Resources

Tips

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ModelOps Developer Guide

Table of Contents

Testing

Running Tests

Using External Dask for Debugging

CI Behavior

Image Management

Single Source of Truth

Using Image Configuration

Digest-Based Deployment (Preventing Cache Issues)

Common Issues & Fixes

Pulumi Passphrase Errors

Bundle Registry Authentication

Kubernetes Using Stale Images

Dask Fixture Timeouts

Debugging Commands

Check Pod Status and Logs

Port Forwarding

Verify Deployments

Pulumi State Inspection

Building & Deploying

GitHub Actions Workflow

Dependency Installation (Calabaria and Other Packages)

Local Development Build

Deployment Verification

Additional Resources

Tips