Skip to content

Inference Operator v3.2.0 ignores l2CacheBackend: redis, hardcodes LMCACHE_REMOTE_URL=sagemaker-hyperpod://... #431

Description

@schroderfernando16

Environment

  • EKS add-on amazon-sagemaker-hyperpod-inference v1.3.0-eksbuild.1 (latest available in ap-northeast-1)
  • Inference Operator image hyperpod-inference-operator:v3.2.0
  • Worker image lmcache/vllm-openai:v0.4.7 (LMCache 0.4.7, vLLM 0.23.0)
  • Instance type ml.g7e.4xlarge, HyperPod EKS-orchestrated cluster

What happened

Setting kvCacheSpec.l2CacheSpec.l2CacheBackend: redis (with l2CacheLocalUrl: redis://...:6379) in the InferenceEndpointConfig has no effect. The operator instead injects into the worker pod:

LMCACHE_REMOTE_URL=sagemaker-hyperpod://$(NODE_IP):9200
LMCACHE_EXTRA_CONFIG={"sagemaker_hyperpod_shared_memory_name": "ai_toolkit_cache"}

The redis value from the CRD is silently dropped — the operator never emits LMCACHE_REMOTE_URL=redis://.... Attempting to override LMCACHE_REMOTE_URL via worker.environmentVariables does not work either: those LMCACHE_* entries do not appear in the rendered pod (the operator strips/overrides them).

Because the worker is forced onto the sagemaker-hyperpod connector, and that connector requires a host-side ai-toolkit daemon (POSIX shared memory /ai_toolkit_cache + TCP :9200) which is not present on the cluster, LMCache enters degraded mode and the L2 cache never stores anything:

LMCache ERROR: Failed to initialize shared memory: [Errno 22] Invalid argument: '/ai_toolkit_cache'
LMCache WARNING: Health check failed: RemoteBackendHealthCheck(sagemaker-hyperpod://<NODE_IP>:9200)
LMCache WARNING: HealthMonitor: System unhealthy, entering degraded mode
LMCache WARNING: LMCache is unhealthy, skipping store operation
... LMCache hit tokens: 0     (External prefix cache hit rate: 0.0%)

Expected behavior

With l2CacheBackend: redis + l2CacheLocalUrl: redis://...:6379, the worker's LMCache should be configured with LMCACHE_REMOTE_URL=redis://..., connect to the specified Redis endpoint, report healthy, and perform L2 KV-cache store/lookup. redis is listed as a supported L2 backend in the docs (KV cache & intelligent routing), so the operator overriding it contradicts the documentation.

Reproduction

  1. Deploy an InferenceEndpointConfig with:
    kvCacheSpec:
      enableL1Cache: true
      enableL2Cache: true
      l2CacheSpec:
        l2CacheBackend: redis
        l2CacheLocalUrl: redis://<redis-svc>.<ns>.svc.cluster.local:6379
  2. (Optionally) add worker.environmentVariables entries for LMCACHE_REMOTE_URL to try to override.
  3. Inspect the rendered worker pod:
    kubectl get pod <worker> -o jsonpath='{range .spec.containers[*].env[*]}{.name}={.value}{"\n"}{end}' | grep LMCACHE
    
  4. Observe LMCACHE_REMOTE_URL=sagemaker-hyperpod://$(NODE_IP):9200 (not redis://...), and the worker logs showing the LMCache unhealthy/degraded loop above.

Additional question (tieredstorage)

Separately: l2CacheBackend: tieredstorage requires the host-side ai-toolkit daemon (shm /ai_toolkit_cache + :9200). On our cluster this daemon is not installed as any DaemonSet, and the add-on configuration schema (aws eks describe-addon-configuration) exposes no toggle for it (only alb, enableCustomServiceAccounts, executionRoleArn, hyperpodClusterArn, jumpstartGatedModelDownloadRoleArn, keda, tlsCertificateS3Bucket). How is tiered storage meant to be enabled — a cluster-creation flag, and can it be enabled on an existing cluster?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions