Skip to content

Tracking: upstream issues to file against ogx-ai/* (drafts inside) #30

@rdwj

Description

@rdwj

Purpose

Three real upstream issues surfaced during the install-ogx and observability-backends validation runs. They need to be filed against the appropriate upstream repo (operator vs server) but only after we research each repo's contribution guidelines, issue templates, and triage process. This issue holds the drafted bodies so they can be reviewed and filed in a separate session.

The drafts below assume an OpenShift 4.20 cluster with RHOAI 3.x and KServe configured for RawDeployment. Versions referenced: ogx-k8s-operator v0.9.0, distribution-starter:0.7.1.


Draft 1 — operator advertises an unbuilt distribution image

Target repo: ogx-ai/ogx-k8s-operator

Title: `status.distributionConfig.availableDistributions` advertises images that were never published

Body:

The operator's `status.distributionConfig.availableDistributions` map advertises a `remote-vllm` distribution at `docker.io/llamastack/distribution-remote-vllm:0.7.1`, but that tag was never published. Patching a `LlamaStackDistribution` to use that name results in `ImagePullBackOff` with `manifest unknown`.

Repro

  1. Install operator at v0.9.0 from `https://raw.githubusercontent.com/ogx-ai/ogx-k8s-operator/release/operator.yaml\`.

  2. Inspect a freshly-created `LlamaStackDistribution`:

    ```bash
    oc get llamastackdistribution -o jsonpath='{.status.distributionConfig.availableDistributions}'
    ```

    Output includes:

    ```
    "remote-vllm": "docker.io/llamastack/distribution-remote-vllm:0.7.1"
    ```

  3. Patch the CR to use that distribution:

    ```bash
    oc patch llamastackdistribution --type=merge \
    -p '{"spec":{"server":{"distribution":{"name":"remote-vllm"}}}}'
    ```

  4. Observe pod state:

    ```
    Failed to pull image "docker.io/llamastack/distribution-remote-vllm:0.7.1":
    manifest for docker.io/llamastack/distribution-remote-vllm:0.7.1 not found:
    manifest unknown
    ```

    Docker Hub's tag list for `llamastack/distribution-remote-vllm` tops out at `0.2.12` — the `0.7.1` tag does not exist.

Expected

Either:

  • The advertised distributions are validated (operator pulls or HEADs each before adding it to `availableDistributions`), or
  • The advertised list is restricted to images the operator has independent reason to believe exist (e.g., a CI-published manifest baked in at release time)

Actual

The operator-advertised list is taken at face value by users, surfaces as `ImagePullBackOff` only after applying the CR.

Suggested fix

Validate or filter `availableDistributions` at controller startup, or document clearly which entries are guaranteed-pullable vs. aspirational.


Draft 2 — operator skips child-Deployment reconciliation on CR spec change

Target repo: `ogx-ai/ogx-k8s-operator`

Title: Controller does not reconcile child `Deployment` after `.spec.server.distribution` change

Body:

When patching `LlamaStackDistribution.spec.server.distribution.name` from one value to another, the LSD's `.spec` updates correctly but the downstream `Deployment`'s container image is not updated. The operator log records `"LlamaStackDistribution CR spec changed"` but no follow-up Deployment reconciliation occurs. Manually deleting the Deployment also does not trigger recreation.

Repro

  1. Apply an LSD with `spec.server.distribution.name: remote-vllm` (broken — see related issue) and observe ImagePullBackOff.

  2. Patch back to a working distribution:

    ```bash
    oc patch llamastackdistribution --type=merge \
    -p '{"spec":{"server":{"distribution":{"name":"starter"}}}}'
    ```

  3. Observe `oc get deployment -o jsonpath='{.spec.template.spec.containers[0].image}'` — still pinned to the broken `distribution-remote-vllm:0.7.1`.

  4. Delete the Deployment manually: `oc delete deployment ` — operator does not recreate it.

  5. Workaround: force a reconcile via a no-op annotation:

    ```bash
    oc annotate llamastackdistribution reconcile-tap=$(date +%s) --overwrite
    ```

    Deployment is recreated with the correct image within seconds.

Expected

Any change to `.spec` that affects the rendered Deployment template should trigger a reconciliation that updates (or, if missing, recreates) the child Deployment.

Actual

Some `.spec` transitions are not propagated to the child Deployment, and a deleted child Deployment is not recreated until the LSD is re-annotated.

Environment

  • ogx-k8s-operator v0.9.0
  • OpenShift 4.20
  • LSD with `spec.server.distribution.name: starter` initially

Suggested fix

Ensure the controller re-renders and reapplies the child Deployment on every `.spec` change, and that the controller's owned-resource watch correctly recreates a deleted child Deployment on the next reconcile loop.


Draft 3 — `setup_telemetry()` initializes `MeterProvider` but not `TracerProvider`

Target repo: `meta-llama/llama-stack` (server-side; could alternatively be addressed in `ogx-ai/ogx-k8s-operator` by adjusting the image entrypoint)

Title: `setup_telemetry()` initializes `MeterProvider` only — traces are silently discarded

Body:

`llama_stack/telemetry/init.py`'s `setup_telemetry()` initializes a `MeterProvider` from `OTEL_*` environment variables but does not initialize a `TracerProvider`. Routers (`core/routers/safety.py`, `core/routers/inference.py`, etc.) call `from opentelemetry import trace` and create spans, but with no `TracerProvider` set, those calls return the no-op default tracer and the spans are discarded.

Symptom

Setting `OTEL_EXPORTER_OTLP_ENDPOINT` (and related env) results in metrics being exported but traces never reaching the configured collector. Jaeger ingests traces (not metrics), so its UI stays empty even though the env vars are wired correctly.

Workaround

Override the container entrypoint to wrap the server with `opentelemetry-instrument`, which auto-configures both `MeterProvider` and `TracerProvider` from the standard `OTEL_*` env vars. The starter image already ships `opentelemetry-distro` and the relevant instrumentation packages, so no image rebuild is needed:

```yaml
spec:
server:
containerSpec:
command: ["opentelemetry-instrument"]
args:
- uvicorn
- llama_stack.core.server.server:create_app
- --host
- "0.0.0.0"
- --port
- "8321"
- --workers
- "1"
- --factory
```

With this in place, Jaeger's `/api/services` returns the configured `OTEL_SERVICE_NAME` and traces include spans for `POST /v1/chat/completions`, `POST /v1/safety/run-shield`, internal `chat ` spans, httpx `connect` spans for outbound vLLM calls, sqlite3 spans, and asgi `http send/receive` spans.

Expected

`setup_telemetry()` configures both `MeterProvider` and `TracerProvider` so that the `from opentelemetry import trace` call sites already present in the routers actually emit spans without requiring an entrypoint override.

Suggested fix

Mirror the `MeterProvider` setup for `TracerProvider` inside `setup_telemetry()`. The OpenTelemetry SDK pieces are already installed in the starter image, so this is a few lines of init.

Environment

  • ogx-k8s-operator v0.9.0
  • distribution-starter:0.7.1
  • OpenTelemetry collector reachable at the configured `OTEL_EXPORTER_OTLP_ENDPOINT`

Filing checklist

  • Read each upstream repo's CONTRIBUTING.md and issue template
  • Confirm the right repo for Draft 3 (server vs operator-side fix)
  • File Draft 1 — operator
  • File Draft 2 — operator
  • File Draft 3 — server (or operator, depending on triage)
  • Cross-link from this issue and close it once all three are filed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions