If you want, you can backup and restore TrustyAI scheduled metrics during operator upgrades (see Backup and restore metrics at the end).
Prereqs: oc (and oc login). jq is optional (it just makes JSON output easier to read).
This fix is relevant if you have GuardrailsOrchestrator instances and:
- your GuardrailsOrchestrator
/healthor/infoendpoints return errors, or - you enabled OpenTelemetry on RHOAI 2.25 and now the
spec.otelExporterfields are not compatible with RHOAI 3.x.
oc get guardrailsorchestrator -n <namespace>If you don't know the namespace, list them across all namespaces:
oc get guardrailsorchestrator -AIf this returns No resources found, you can skip Fix 1.
This patches the GuardrailsOrchestrator Deployment(s) to add the expected readiness probe on port 8034 at /health.
./patch-guardrails-deployment.sh <namespace>Verify:
ORCH_ROUTE_HEALTH=$(oc get routes -n <namespace> guardrails-orchestrator-health -o jsonpath='{.spec.host}')
curl -s https://$ORCH_ROUTE_HEALTH/info | jqIf the route name differs, list routes in the namespace and pick the GuardrailsOrchestrator health route:
oc get routes -n <namespace>If you enabled OpenTelemetry on RHOAI 2.25, the spec.otelExporter keys changed in RHOAI 3.x. This script migrates the old fields to the new ones.
- Check what would be migrated:
./migrate-gorch-otel-exporter.sh --namespace <namespace>
./migrate-gorch-otel-exporter.sh --namespace <namespace> --dry-run- Apply the migration:
./migrate-gorch-otel-exporter.sh --namespace <namespace> --fix- Verify the migrated fields (you should see keys like
otlpProtocol,otlpTracesEndpoint,otlpMetricsEndpoint,enableTraces,enableMetrics):
oc get guardrailsorchestrator -n <namespace> <name> -o jsonpath='{.spec.otelExporter}{"\n"}'This issue can occur when there is a llm deployment and then trustyai service is created in the same namespace. The LLM deployment would be then stuck in pending
- Look for an InferenceService predictor that has both a Running pod and a Pending pod:
oc get pods -n <namespace> -l component=predictorTypical symptoms:
- Two pods for the same predictor
- One pod Running, one pod Pending
- Different container counts (e.g.
2/2vs0/3)
- Use the helper script to check for deadlocks (recommended):
./break-gpu-deadlock.sh --namespace <namespace> --checkTo avoid this deadlock, you can run script break-gpu-deadlock.sh which will delete the pending pod and then re-create it. This will allow the LLM deployment to proceed without being stuck in pending state.
- Fix deadlocks:
./break-gpu-deadlock.sh --namespace <namespace> --fix- Verify pods are no longer stuck:
oc get pods -n <namespace> -l component=predictorThis is optional. Only do this if you have TrustyAIService instances (scheduled metrics live there).
oc get trustyaiservice -AIf this returns No resources found, you can skip metrics backup/restore.
Run this once per namespace that has a TrustyAIService:
./backup-metrics.sh -n <namespace>This writes a timestamped JSON under ./backups/ and updates ./backups/trustyai-metrics-latest.json.
./restore-metrics.sh -n <namespace> -f backups/trustyai-metrics-latest.jsonUseful options:
--dry-runto preview--skip-existingto avoid re-creating metrics that already exist
Note: restored metrics receive new UUIDs (original IDs are not preserved).