Dedicated service for model serving and inference (alternative to using api-service for model endpoints).
This service provides:
- Optimized model serving
- Fast inference endpoints
- Model version management
- A/B testing capabilities
- Model monitoring
uvicorn serve:app --host 0.0.0.0 --port 8080curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"features": {...}}'MODEL_NAME: Name of the model to serveMODEL_VERSION: Model version (default: latest)GCS_BUCKET_MODELS: GCS bucket with modelsBATCH_SIZE: Batch size for inferenceMAX_WORKERS: Number of workers
docker run -p 8080:8080 mlops-model-deployDeployed using Pulumi configuration in infrastructure/
Can also deploy to Vertex AI for managed serving
- Model caching
- Batch processing
- GPU support (if available)
- Async inference for large batches
- Response streaming
Build:
docker build -t mlops-model-deploy .Run:
docker run -p 8080:8080 mlops-model-deploy