Purpose
Three real upstream issues surfaced during the install-ogx and observability-backends validation runs. They need to be filed against the appropriate upstream repo (operator vs server) but only after we research each repo's contribution guidelines, issue templates, and triage process. This issue holds the drafted bodies so they can be reviewed and filed in a separate session.
The drafts below assume an OpenShift 4.20 cluster with RHOAI 3.x and KServe configured for RawDeployment. Versions referenced: ogx-k8s-operator v0.9.0, distribution-starter:0.7.1.
Draft 1 — operator advertises an unbuilt distribution image
Target repo: ogx-ai/ogx-k8s-operator
Title: `status.distributionConfig.availableDistributions` advertises images that were never published
Body:
The operator's `status.distributionConfig.availableDistributions` map advertises a `remote-vllm` distribution at `docker.io/llamastack/distribution-remote-vllm:0.7.1`, but that tag was never published. Patching a `LlamaStackDistribution` to use that name results in `ImagePullBackOff` with `manifest unknown`.
Repro
-
Install operator at v0.9.0 from `https://raw.githubusercontent.com/ogx-ai/ogx-k8s-operator/release/operator.yaml\`.
-
Inspect a freshly-created `LlamaStackDistribution`:
```bash
oc get llamastackdistribution -o jsonpath='{.status.distributionConfig.availableDistributions}'
```
Output includes:
```
"remote-vllm": "docker.io/llamastack/distribution-remote-vllm:0.7.1"
```
-
Patch the CR to use that distribution:
```bash
oc patch llamastackdistribution --type=merge \
-p '{"spec":{"server":{"distribution":{"name":"remote-vllm"}}}}'
```
-
Observe pod state:
```
Failed to pull image "docker.io/llamastack/distribution-remote-vllm:0.7.1":
manifest for docker.io/llamastack/distribution-remote-vllm:0.7.1 not found:
manifest unknown
```
Docker Hub's tag list for `llamastack/distribution-remote-vllm` tops out at `0.2.12` — the `0.7.1` tag does not exist.
Expected
Either:
- The advertised distributions are validated (operator pulls or HEADs each before adding it to `availableDistributions`), or
- The advertised list is restricted to images the operator has independent reason to believe exist (e.g., a CI-published manifest baked in at release time)
Actual
The operator-advertised list is taken at face value by users, surfaces as `ImagePullBackOff` only after applying the CR.
Suggested fix
Validate or filter `availableDistributions` at controller startup, or document clearly which entries are guaranteed-pullable vs. aspirational.
Draft 2 — operator skips child-Deployment reconciliation on CR spec change
Target repo: `ogx-ai/ogx-k8s-operator`
Title: Controller does not reconcile child `Deployment` after `.spec.server.distribution` change
Body:
When patching `LlamaStackDistribution.spec.server.distribution.name` from one value to another, the LSD's `.spec` updates correctly but the downstream `Deployment`'s container image is not updated. The operator log records `"LlamaStackDistribution CR spec changed"` but no follow-up Deployment reconciliation occurs. Manually deleting the Deployment also does not trigger recreation.
Repro
-
Apply an LSD with `spec.server.distribution.name: remote-vllm` (broken — see related issue) and observe ImagePullBackOff.
-
Patch back to a working distribution:
```bash
oc patch llamastackdistribution --type=merge \
-p '{"spec":{"server":{"distribution":{"name":"starter"}}}}'
```
-
Observe `oc get deployment -o jsonpath='{.spec.template.spec.containers[0].image}'` — still pinned to the broken `distribution-remote-vllm:0.7.1`.
-
Delete the Deployment manually: `oc delete deployment ` — operator does not recreate it.
-
Workaround: force a reconcile via a no-op annotation:
```bash
oc annotate llamastackdistribution reconcile-tap=$(date +%s) --overwrite
```
Deployment is recreated with the correct image within seconds.
Expected
Any change to `.spec` that affects the rendered Deployment template should trigger a reconciliation that updates (or, if missing, recreates) the child Deployment.
Actual
Some `.spec` transitions are not propagated to the child Deployment, and a deleted child Deployment is not recreated until the LSD is re-annotated.
Environment
- ogx-k8s-operator v0.9.0
- OpenShift 4.20
- LSD with `spec.server.distribution.name: starter` initially
Suggested fix
Ensure the controller re-renders and reapplies the child Deployment on every `.spec` change, and that the controller's owned-resource watch correctly recreates a deleted child Deployment on the next reconcile loop.
Draft 3 — `setup_telemetry()` initializes `MeterProvider` but not `TracerProvider`
Target repo: `meta-llama/llama-stack` (server-side; could alternatively be addressed in `ogx-ai/ogx-k8s-operator` by adjusting the image entrypoint)
Title: `setup_telemetry()` initializes `MeterProvider` only — traces are silently discarded
Body:
`llama_stack/telemetry/init.py`'s `setup_telemetry()` initializes a `MeterProvider` from `OTEL_*` environment variables but does not initialize a `TracerProvider`. Routers (`core/routers/safety.py`, `core/routers/inference.py`, etc.) call `from opentelemetry import trace` and create spans, but with no `TracerProvider` set, those calls return the no-op default tracer and the spans are discarded.
Symptom
Setting `OTEL_EXPORTER_OTLP_ENDPOINT` (and related env) results in metrics being exported but traces never reaching the configured collector. Jaeger ingests traces (not metrics), so its UI stays empty even though the env vars are wired correctly.
Workaround
Override the container entrypoint to wrap the server with `opentelemetry-instrument`, which auto-configures both `MeterProvider` and `TracerProvider` from the standard `OTEL_*` env vars. The starter image already ships `opentelemetry-distro` and the relevant instrumentation packages, so no image rebuild is needed:
```yaml
spec:
server:
containerSpec:
command: ["opentelemetry-instrument"]
args:
- uvicorn
- llama_stack.core.server.server:create_app
- --host
- "0.0.0.0"
- --port
- "8321"
- --workers
- "1"
- --factory
```
With this in place, Jaeger's `/api/services` returns the configured `OTEL_SERVICE_NAME` and traces include spans for `POST /v1/chat/completions`, `POST /v1/safety/run-shield`, internal `chat ` spans, httpx `connect` spans for outbound vLLM calls, sqlite3 spans, and asgi `http send/receive` spans.
Expected
`setup_telemetry()` configures both `MeterProvider` and `TracerProvider` so that the `from opentelemetry import trace` call sites already present in the routers actually emit spans without requiring an entrypoint override.
Suggested fix
Mirror the `MeterProvider` setup for `TracerProvider` inside `setup_telemetry()`. The OpenTelemetry SDK pieces are already installed in the starter image, so this is a few lines of init.
Environment
- ogx-k8s-operator v0.9.0
- distribution-starter:0.7.1
- OpenTelemetry collector reachable at the configured `OTEL_EXPORTER_OTLP_ENDPOINT`
Filing checklist
Purpose
Three real upstream issues surfaced during the install-ogx and observability-backends validation runs. They need to be filed against the appropriate upstream repo (operator vs server) but only after we research each repo's contribution guidelines, issue templates, and triage process. This issue holds the drafted bodies so they can be reviewed and filed in a separate session.
The drafts below assume an OpenShift 4.20 cluster with RHOAI 3.x and KServe configured for RawDeployment. Versions referenced: ogx-k8s-operator v0.9.0, distribution-starter:0.7.1.
Draft 1 — operator advertises an unbuilt distribution image
Target repo:
ogx-ai/ogx-k8s-operatorTitle: `status.distributionConfig.availableDistributions` advertises images that were never published
Body:
The operator's `status.distributionConfig.availableDistributions` map advertises a `remote-vllm` distribution at `docker.io/llamastack/distribution-remote-vllm:0.7.1`, but that tag was never published. Patching a `LlamaStackDistribution` to use that name results in `ImagePullBackOff` with `manifest unknown`.
Repro
Install operator at v0.9.0 from `https://raw.githubusercontent.com/ogx-ai/ogx-k8s-operator/release/operator.yaml\`.
Inspect a freshly-created `LlamaStackDistribution`:
```bash
oc get llamastackdistribution -o jsonpath='{.status.distributionConfig.availableDistributions}'
```
Output includes:
```
"remote-vllm": "docker.io/llamastack/distribution-remote-vllm:0.7.1"
```
Patch the CR to use that distribution:
```bash
oc patch llamastackdistribution --type=merge \
-p '{"spec":{"server":{"distribution":{"name":"remote-vllm"}}}}'
```
Observe pod state:
```
Failed to pull image "docker.io/llamastack/distribution-remote-vllm:0.7.1":
manifest for docker.io/llamastack/distribution-remote-vllm:0.7.1 not found:
manifest unknown
```
Docker Hub's tag list for `llamastack/distribution-remote-vllm` tops out at `0.2.12` — the `0.7.1` tag does not exist.
Expected
Either:
Actual
The operator-advertised list is taken at face value by users, surfaces as `ImagePullBackOff` only after applying the CR.
Suggested fix
Validate or filter `availableDistributions` at controller startup, or document clearly which entries are guaranteed-pullable vs. aspirational.
Draft 2 — operator skips child-Deployment reconciliation on CR spec change
Target repo: `ogx-ai/ogx-k8s-operator`
Title: Controller does not reconcile child `Deployment` after `.spec.server.distribution` change
Body:
When patching `LlamaStackDistribution.spec.server.distribution.name` from one value to another, the LSD's `.spec` updates correctly but the downstream `Deployment`'s container image is not updated. The operator log records `"LlamaStackDistribution CR spec changed"` but no follow-up Deployment reconciliation occurs. Manually deleting the Deployment also does not trigger recreation.
Repro
Apply an LSD with `spec.server.distribution.name: remote-vllm` (broken — see related issue) and observe ImagePullBackOff.
Patch back to a working distribution:
```bash
oc patch llamastackdistribution --type=merge \
-p '{"spec":{"server":{"distribution":{"name":"starter"}}}}'
```
Observe `oc get deployment -o jsonpath='{.spec.template.spec.containers[0].image}'` — still pinned to the broken `distribution-remote-vllm:0.7.1`.
Delete the Deployment manually: `oc delete deployment ` — operator does not recreate it.
Workaround: force a reconcile via a no-op annotation:
```bash
oc annotate llamastackdistribution reconcile-tap=$(date +%s) --overwrite
```
Deployment is recreated with the correct image within seconds.
Expected
Any change to `.spec` that affects the rendered Deployment template should trigger a reconciliation that updates (or, if missing, recreates) the child Deployment.
Actual
Some `.spec` transitions are not propagated to the child Deployment, and a deleted child Deployment is not recreated until the LSD is re-annotated.
Environment
Suggested fix
Ensure the controller re-renders and reapplies the child Deployment on every `.spec` change, and that the controller's owned-resource watch correctly recreates a deleted child Deployment on the next reconcile loop.
Draft 3 — `setup_telemetry()` initializes `MeterProvider` but not `TracerProvider`
Target repo: `meta-llama/llama-stack` (server-side; could alternatively be addressed in `ogx-ai/ogx-k8s-operator` by adjusting the image entrypoint)
Title: `setup_telemetry()` initializes `MeterProvider` only — traces are silently discarded
Body:
`llama_stack/telemetry/init.py`'s `setup_telemetry()` initializes a `MeterProvider` from `OTEL_*` environment variables but does not initialize a `TracerProvider`. Routers (`core/routers/safety.py`, `core/routers/inference.py`, etc.) call `from opentelemetry import trace` and create spans, but with no `TracerProvider` set, those calls return the no-op default tracer and the spans are discarded.
Symptom
Setting `OTEL_EXPORTER_OTLP_ENDPOINT` (and related env) results in metrics being exported but traces never reaching the configured collector. Jaeger ingests traces (not metrics), so its UI stays empty even though the env vars are wired correctly.
Workaround
Override the container entrypoint to wrap the server with `opentelemetry-instrument`, which auto-configures both `MeterProvider` and `TracerProvider` from the standard `OTEL_*` env vars. The starter image already ships `opentelemetry-distro` and the relevant instrumentation packages, so no image rebuild is needed:
```yaml
spec:
server:
containerSpec:
command: ["opentelemetry-instrument"]
args:
- uvicorn
- llama_stack.core.server.server:create_app
- --host
- "0.0.0.0"
- --port
- "8321"
- --workers
- "1"
- --factory
```
With this in place, Jaeger's `/api/services` returns the configured `OTEL_SERVICE_NAME` and traces include spans for `POST /v1/chat/completions`, `POST /v1/safety/run-shield`, internal `chat ` spans, httpx `connect` spans for outbound vLLM calls, sqlite3 spans, and asgi `http send/receive` spans.
Expected
`setup_telemetry()` configures both `MeterProvider` and `TracerProvider` so that the `from opentelemetry import trace` call sites already present in the routers actually emit spans without requiring an entrypoint override.
Suggested fix
Mirror the `MeterProvider` setup for `TracerProvider` inside `setup_telemetry()`. The OpenTelemetry SDK pieces are already installed in the starter image, so this is a few lines of init.
Environment
Filing checklist