trustyai-private

If you want, you can backup and restore TrustyAI scheduled metrics during operator upgrades (see Backup and restore metrics at the end).

Fixes

Prereqs: oc (and oc login). jq is optional (it just makes JSON output easier to read).

Fix 1: GuardrailsOrchestrator (RHOAI 2.25 -> 3.x)

This fix is relevant if you have GuardrailsOrchestrator instances and:

your GuardrailsOrchestrator /health or /info endpoints return errors, or
you enabled OpenTelemetry on RHOAI 2.25 and now the spec.otelExporter fields are not compatible with RHOAI 3.x.

Step 0: Check if you have GuardrailsOrchestrator instances

oc get guardrailsorchestrator -n <namespace>

If you don't know the namespace, list them across all namespaces:

oc get guardrailsorchestrator -A

If this returns No resources found, you can skip Fix 1.

Step 1: Fix `/health` and `/info` endpoint errors (deployment readiness probe)

This patches the GuardrailsOrchestrator Deployment(s) to add the expected readiness probe on port 8034 at /health.

./patch-guardrails-deployment.sh <namespace>

Verify:

ORCH_ROUTE_HEALTH=$(oc get routes -n <namespace> guardrails-orchestrator-health -o jsonpath='{.spec.host}')
curl -s https://$ORCH_ROUTE_HEALTH/info | jq

If the route name differs, list routes in the namespace and pick the GuardrailsOrchestrator health route:

oc get routes -n <namespace>

Step 2: Fix OpenTelemetry `spec.otelExporter` incompatibility (CR migration)

If you enabled OpenTelemetry on RHOAI 2.25, the spec.otelExporter keys changed in RHOAI 3.x. This script migrates the old fields to the new ones.

Check what would be migrated:

./migrate-gorch-otel-exporter.sh --namespace <namespace>
./migrate-gorch-otel-exporter.sh --namespace <namespace> --dry-run

Apply the migration:

./migrate-gorch-otel-exporter.sh --namespace <namespace> --fix

Verify the migrated fields (you should see keys like otlpProtocol, otlpTracesEndpoint, otlpMetricsEndpoint, enableTraces, enableMetrics):

oc get guardrailsorchestrator -n <namespace> <name> -o jsonpath='{.spec.otelExporter}{"\n"}'

Fix 2: GPU deadlock

This issue can occur when there is a llm deployment and then trustyai service is created in the same namespace. The LLM deployment would be then stuck in pending

How to identify

Look for an InferenceService predictor that has both a Running pod and a Pending pod:

oc get pods -n <namespace> -l component=predictor

Typical symptoms:

Two pods for the same predictor
One pod Running, one pod Pending
Different container counts (e.g. 2/2 vs 0/3)

Use the helper script to check for deadlocks (recommended):

./break-gpu-deadlock.sh --namespace <namespace> --check

Solution

To avoid this deadlock, you can run script break-gpu-deadlock.sh which will delete the pending pod and then re-create it. This will allow the LLM deployment to proceed without being stuck in pending state.

Fix deadlocks:

./break-gpu-deadlock.sh --namespace <namespace> --fix

Verify pods are no longer stuck:

oc get pods -n <namespace> -l component=predictor

Backup and restore metrics

This is optional. Only do this if you have TrustyAIService instances (scheduled metrics live there).

Step 0: Check if you have TrustyAIService instances

oc get trustyaiservice -A

If this returns No resources found, you can skip metrics backup/restore.

Backup (pre-upgrade)

Run this once per namespace that has a TrustyAIService:

./backup-metrics.sh -n <namespace>

This writes a timestamped JSON under ./backups/ and updates ./backups/trustyai-metrics-latest.json.

Restore (post-upgrade)

./restore-metrics.sh -n <namespace> -f backups/trustyai-metrics-latest.json

Useful options:

--dry-run to preview
--skip-existing to avoid re-creating metrics that already exist

Note: restored metrics receive new UUIDs (original IDs are not preserved).

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
recreate-manifests		recreate-manifests
LICENSE		LICENSE
README.md		README.md
backup-metrics.sh		backup-metrics.sh
break-gpu-deadlock.sh		break-gpu-deadlock.sh
migrate-gorch-otel-exporter.sh		migrate-gorch-otel-exporter.sh
patch-guardrails-deployment.sh		patch-guardrails-deployment.sh
restore-metrics.sh		restore-metrics.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

trustyai-private

Fixes

Fix 1: GuardrailsOrchestrator (RHOAI 2.25 -> 3.x)

Step 0: Check if you have GuardrailsOrchestrator instances

Step 1: Fix `/health` and `/info` endpoint errors (deployment readiness probe)

Step 2: Fix OpenTelemetry `spec.otelExporter` incompatibility (CR migration)

Fix 2: GPU deadlock

How to identify

Solution

Backup and restore metrics

Step 0: Check if you have TrustyAIService instances

Backup (pre-upgrade)

Restore (post-upgrade)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

trustyai-private

Fixes

Fix 1: GuardrailsOrchestrator (RHOAI 2.25 -> 3.x)

Step 0: Check if you have GuardrailsOrchestrator instances

Step 1: Fix /health and /info endpoint errors (deployment readiness probe)

Step 2: Fix OpenTelemetry spec.otelExporter incompatibility (CR migration)

Fix 2: GPU deadlock

How to identify

Solution

Backup and restore metrics

Step 0: Check if you have TrustyAIService instances

Backup (pre-upgrade)

Restore (post-upgrade)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1: Fix `/health` and `/info` endpoint errors (deployment readiness probe)

Step 2: Fix OpenTelemetry `spec.otelExporter` incompatibility (CR migration)

Packages