-
Notifications
You must be signed in to change notification settings - Fork 224
Open
Description
Description
MLServer fails to load models when deployed on KServe using OCI model images (model-car sidecar). The model files are present in /mnt/models but runtimes fail to load them.
Environment
- MLServer Version: 1.7.0+
- Platform: KServe on Kubernetes
- Storage: OCI Image via model-car sidecar
- Affected Runtimes: XGBoost, LightGBM, CatBoost, and potentially others
Steps to Reproduce
- Deploy an InferenceService with OCI image storage:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: xgboost-model
spec:
predictor:
model:
modelFormat:
name: xgboost
storageUri: oci://quay.io/my-registry/xgboost-model:latest
runtime: mlserver- Check MLServer logs
- Model fails to load despite files being present
Expected Behavior
Model should load successfully from /mnt/models mounted by model-car sidecar.
Actual Behavior
Error 1: Path Resolution Failure
XGBoostError: filesystem error: cannot make canonical path: No such file or directory [/mnt/models/model.json]
Python can read the file, but C++ libraries fail:
import os
path = "/mnt/models/model.json"
print(os.path.exists(path)) # True
print(os.path.realpath(path)) # /proc/123/root/... or bind mount path
# Python works:
with open(path, 'r') as f:
f.read() # ✅ Success
# XGBoost C++ fails:
import xgboost as xgb
xgb.Booster(model_file=path) # ❌ FailsError 2: Race Condition at Startup
MLServer starts "successfully" but no models are loaded:
2026-01-23 11:58:14,866 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:8080
2026-01-23 11:58:14,889 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:8082
2026-01-23 11:58:14,891 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:8081
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
No model loading logs appear. Pod shows ready, but model is not available:
$ curl http://localhost:8080/v2/models/my-model/ready
# Returns {"error":"Model my-model not found"}
$ kubectl exec <pod> -- ls /mnt/models
# Files are present: model-settings.json, model.bstIssue: MLServer starts before model-car mounts the files. It finds no models and doesn't retry.
Root Cause
- Bind Mount Issue: Model-car sidecar uses bind mounts or proc-based paths (
/proc/<pid>/root/...) that C++/native libraries cannot canonicalize - Race Condition: Model-car mounts files after MLServer starts, causing timing issues
- Symlink Resolution:
os.path.realpath()returns paths inaccessible to C++ code
Impact
- ❌ Blocks deployments on KServe with OCI images
- ❌ Impacts all runtimes using C++/native libraries (XGBoost, LightGBM, CatBoost)
Additional Context
- Python file I/O works fine with these mounts
- Issue is specific to native libraries that try to canonicalize paths
Labels
bug kserve deployment high-priority xgboost runtime
Metadata
Metadata
Assignees
Labels
No labels