This guide consolidates developer documentation for ModelOps. For architecture details, see docs/architecture/.
# Unit tests (default, fast ~10-20s)
make test
# or
uv run pytest
# Integration tests (creates LocalCluster instances)
make test-integration
# Run specific test file
uv run pytest tests/test_component_dependencies.py
# Run specific test function
uv run pytest tests/test_dask_serialization.py::test_cloudpickle_simtask
# Run with coverage
uv run pytest --cov=modelops --cov-report=htmlBy default, integration tests create their own LocalCluster. To use an external cluster:
# Start external Dask cluster
make dask-local
# Use external cluster (must explicitly opt-in)
DASK_ADDRESS=tcp://localhost:8786 make test-integration
# or
make test-integration-external # uses --dask-address flag
# Stop when done
make dask-stop- Resource Scaling: CI uses 1 worker with 1GB memory (vs 2 workers, 2GB locally)
- Timeouts: 60-second per test, 10-minute overall
- Auto-skip: Tests skip gracefully when resources are constrained
All Docker image references are centralized in modelops-images.yaml:
profiles:
prod:
registry: {host: ghcr.io, org: institutefordiseasemodeling}
default_tag: latest
dev:
registry: {host: ghcr.io, org: institutefordiseasemodeling}
default_tag: dev
images:
scheduler: {name: modelops-dask-scheduler}
worker: {name: modelops-dask-worker}
runner: {name: modelops-dask-runner}# CLI access to image config
mops dev images print scheduler # Single image
mops dev images print --all # All images
mops dev images export-env # Export as env vars
# In Python code
from modelops.images import get_image_config
config = get_image_config()
worker_image = config.worker_image() # ghcr.io/institutefordiseasemodeling/modelops-dask-worker:latest
# In Makefile
WORKER_IMAGE := $(shell uv run mops dev images print worker)The :latest tag is mutable and heavily cached by Kubernetes. Use digests for reliable deployments:
# Build and capture digest
make build-worker
# Stores digest in .build/worker.digest
# Deploy by digest (not tag)
kubectl set image deployment/dask-workers \
worker=$(WORKER_IMAGE)@$(cat .build/worker.digest) \
-n modelops-dask-dev
# Verify deployment
kubectl get pods -l app=dask-worker -o jsonpath='{.items[0].status.containerStatuses[0].imageID}'Error: "incorrect passphrase" when accessing Pulumi stacks
Root Cause: PULUMI_CONFIG_PASSPHRASE_FILE not passed to subprocess
Fix: Ensure env_vars=dict(os.environ) in src/modelops/core/automation.py:workspace_options()
# CRITICAL: Pass full environment to subprocess
return auto.LocalWorkspaceOptions(
env_vars=dict(os.environ) # Must pass environment
)Error: "Expecting value: line 1 column 1 (char 0)" when fetching bundles
Root Cause: ACR returning HTML login page instead of JSON
Common Causes:
- Repository name mismatch (e.g., pushing to
smoke_bundle, pulling frommodelops-bundles) - Bundle reference format inconsistency (need
repository@sha256:digest) - Wrong registry URL in environment
Fix: Ensure consistent repository naming and format:
# Correct format
bundle_ref = "smoke_bundle@sha256:abc123..."
MODELOPS_BUNDLE_REGISTRY = "modelopsdevacrvsb.azurecr.io" # No repository pathSymptom: Fixes aren't working despite make deploy
Root Cause: Kubernetes caches :latest tags aggressively
Quick Fix:
# Force delete pods to pull fresh images
kubectl delete pods -n modelops-dask-dev -l app=dask-worker --force --grace-period=0
# Verify new code is running
kubectl exec deployment/dask-workers -n modelops-dask-dev -- \
grep -A3 "your_function" /path/to/file.pyBetter Fix: Use digest-based deployment (see above)
Error: Integration tests hang for 30+ seconds
Root Cause: Tests trying to connect to external Dask before creating LocalCluster
Fix: Default to LocalCluster (already fixed in conftest.py):
# Tests now create LocalCluster by default
# Must explicitly opt-in to external with --dask-address or DASK_ADDRESS# List pods
kubectl get pods -n modelops-dask-dev
# Check pod details
kubectl describe pod <pod-name> -n modelops-dask-dev
# View logs
kubectl logs -n modelops-dask-dev -l app=dask-scheduler
kubectl logs -n modelops-dask-dev -l app=dask-worker --tail=50
# Follow logs
kubectl logs -f deployment/dask-workers -n modelops-dask-dev# Dask scheduler
kubectl port-forward -n modelops-dask-dev svc/dask-scheduler 8786:8786
# Dask dashboard
kubectl port-forward -n modelops-dask-dev svc/dask-scheduler 8787:8787
# Multiple ports
kubectl port-forward -n modelops-dask-dev svc/dask-scheduler 8786:8786 8787:8787# Check what image a pod is actually running
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].imageID}'
# Check environment variables
kubectl exec -it deployment/dask-workers -n modelops-dask-dev -- env | grep MODELOPS
# Check actual code in pod
kubectl exec deployment/dask-workers -n modelops-dask-dev -- \
cat /usr/local/lib/python3.13/site-packages/modelops/__version__.py
# Force rollout restart
kubectl rollout restart deployment/dask-workers -n modelops-dask-dev
kubectl rollout status deployment/dask-workers -n modelops-dask-dev# Check stack outputs
pulumi stack output --stack modelops-infra-dev
# List all stacks
pulumi stack ls
# Check specific output
pulumi stack output kubeconfig --stack modelops-infra-dev
# Show full stack state (verbose)
pulumi stack export --stack modelops-infra-dev | jq .Images are automatically built on push to main:
- Triggered by
.github/workflows/docker-build.yml - Pushes to
ghcr.io/institutefordiseasemodeling/ - Tagged with commit SHA and
latest
External dependencies like modelops-calabaria and modelops-bundle are installed at Docker image build time via pip from GitHub repositories.
How it works:
In docker/Dockerfile.runner (lines 40-44):
# Install modelops-bundle (needed for bundle management)
RUN pip install --no-cache-dir git+https://${GITHUB_TOKEN}@github.com/institutefordiseasemodeling/modelops-bundle.git
# Install modelops-calabaria for calibration support
RUN pip install --no-cache-dir git+https://${GITHUB_TOKEN}@github.com/institutefordiseasemodeling/modelops-calabaria.gitThese lines pull the latest code from the main branch of each repository at build time.
Deployment Workflow for Fixes:
When you make a fix to calabaria (or any other installed dependency):
-
Commit and push to calabaria repo:
cd modelops-calabaria git add src/modelops_calabaria/calibration/wire.py git commit -m "fix: handle dict results in convert_to_trial_result" git push origin main
-
Trigger image rebuild: The fix won't be in running pods until images are rebuilt with the updated dependency.
Option A: Automatic rebuild (CI/CD)
- Push any commit to the
modelopsrepo (even a trivial change) - GitHub Actions will trigger and rebuild all images
- Images are pushed to GHCR with the new calabaria code included
Option B: Manual rebuild
cd modelops make build-runner # Rebuilds runner image with latest calabaria from GitHub docker push ghcr.io/institutefordiseasemodeling/modelops-dask-runner:latest
- Push any commit to the
-
Restart Kubernetes pods to pull new images:
kubectl rollout restart deployment/dask-runner -n modelops-dask-dev kubectl rollout status deployment/dask-runner -n modelops-dask-dev
Why Image Rebuilds Are Required:
Unlike code changes to modelops itself (which are in the COPY layer), calabaria is installed via pip install git+https://.... This means:
- The calabaria code is baked into the image at build time
- Simply restarting pods won't pick up calabaria fixes
- You must rebuild the image to get the latest code from GitHub
- Kubernetes image pull policies may cache
:latesttags aggressively (use digests for reliability)
Quick Fix Verification:
After deploying, verify the fix is actually running:
# Check that the new code is present in the pod
kubectl exec deployment/dask-runner -n modelops-dask-dev -- \
grep -A5 "isinstance(result, dict)" \
/usr/local/lib/python3.12/site-packages/modelops_calabaria/calibration/wire.py
# Or check the installed package version/commit
kubectl exec deployment/dask-runner -n modelops-dask-dev -- \
pip show modelops-calabaria# Build all images
make build # Builds scheduler, worker, runner
# Build specific image
make build-worker
make build-scheduler
make build-runner
# Push to registry (after building)
make push
# Pull latest from registry
make pull-latest
# Full deployment cycle
make build push deploy verify-deployAlways verify deployments actually worked:
# Custom verification command
make verify-deploy
# Manual verification
kubectl get pods -n modelops-dask-dev
kubectl logs -n modelops-dask-dev -l app=dask-worker --tail=10
# Run smoke test
mops dev smoke-test- Always verify deployments - Don't trust that
make deployworked - Use digests for production - Tags are mutable and cached
- Check environment variables - Many issues are missing env vars
- Force delete pods when in doubt - Kubernetes caching is aggressive
- Review the image config - Single source of truth in
modelops-images.yaml