Skip to content

MLServer fails to load models from OCI image mounts on KServe #2352

@Snomaan6846

Description

@Snomaan6846

Description

MLServer fails to load models when deployed on KServe using OCI model images (model-car sidecar). The model files are present in /mnt/models but runtimes fail to load them.

Environment

  • MLServer Version: 1.7.0+
  • Platform: KServe on Kubernetes
  • Storage: OCI Image via model-car sidecar
  • Affected Runtimes: XGBoost, LightGBM, CatBoost, and potentially others

Steps to Reproduce

  1. Deploy an InferenceService with OCI image storage:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: xgboost-model
spec:
  predictor:
    model:
      modelFormat:
        name: xgboost
      storageUri: oci://quay.io/my-registry/xgboost-model:latest
      runtime: mlserver
  1. Check MLServer logs
  2. Model fails to load despite files being present

Expected Behavior

Model should load successfully from /mnt/models mounted by model-car sidecar.

Actual Behavior

Error 1: Path Resolution Failure

XGBoostError: filesystem error: cannot make canonical path: No such file or directory [/mnt/models/model.json]

Python can read the file, but C++ libraries fail:

import os
path = "/mnt/models/model.json"
print(os.path.exists(path))  # True
print(os.path.realpath(path))  # /proc/123/root/... or bind mount path

# Python works:
with open(path, 'r') as f:
    f.read()  # ✅ Success

# XGBoost C++ fails:
import xgboost as xgb
xgb.Booster(model_file=path)  # ❌ Fails

Error 2: Race Condition at Startup

MLServer starts "successfully" but no models are loaded:

2026-01-23 11:58:14,866 [mlserver.rest] INFO - HTTP server running on http://0.0.0.0:8080
2026-01-23 11:58:14,889 [mlserver.metrics] INFO - Metrics server running on http://0.0.0.0:8082
2026-01-23 11:58:14,891 [mlserver.grpc] INFO - gRPC server running on http://0.0.0.0:8081
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

No model loading logs appear. Pod shows ready, but model is not available:

$ curl http://localhost:8080/v2/models/my-model/ready
# Returns {"error":"Model my-model not found"}

$ kubectl exec <pod> -- ls /mnt/models
# Files are present: model-settings.json, model.bst

Issue: MLServer starts before model-car mounts the files. It finds no models and doesn't retry.

Root Cause

  1. Bind Mount Issue: Model-car sidecar uses bind mounts or proc-based paths (/proc/<pid>/root/...) that C++/native libraries cannot canonicalize
  2. Race Condition: Model-car mounts files after MLServer starts, causing timing issues
  3. Symlink Resolution: os.path.realpath() returns paths inaccessible to C++ code

Impact

  • ❌ Blocks deployments on KServe with OCI images
  • ❌ Impacts all runtimes using C++/native libraries (XGBoost, LightGBM, CatBoost)

Additional Context

  • Python file I/O works fine with these mounts
  • Issue is specific to native libraries that try to canonicalize paths

Labels

bug kserve deployment high-priority xgboost runtime

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions